Bucketed common vector scaling for authorship attribution in heterogeneous web collections: A scaling approach for authorship attribution


Agun H. V., YILMAZEL Ö.

JOURNAL OF INFORMATION SCIENCE, cilt.46, sa.5, ss.683-695, 2020 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 46 Sayı: 5
  • Basım Tarihi: 2020
  • Doi Numarası: 10.1177/0165551519863350
  • Dergi Adı: JOURNAL OF INFORMATION SCIENCE
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus, Academic Search Premier, FRANCIS, IBZ Online, ABI/INFORM, Aerospace Database, Analytical Abstracts, Applied Science & Technology Source, Business Source Elite, Business Source Premier, Communication Abstracts, Compendex, Computer & Applied Sciences, EBSCO Education Source, Education Abstracts, Index Islamicus, Information Science and Technology Abstracts, INSPEC, Library and Information Science Abstracts, Library Literature and Information Science, Metadex, Civil Engineering Abstracts, Library, Information Science & Technology Abstracts (LISTA)
  • Sayfa Sayıları: ss.683-695
  • Anahtar Kelimeler: Authorship attribution, common vector approach, domain scaling, text classification, IDENTIFICATION
  • Anadolu Üniversitesi Adresli: Evet

Özet

Domain, genre and topic influences on author style adversely affect the performance of authorship attribution (AA) in multi-genre and multi-domain data sets. Although recent approaches to AA tasks focus on suggesting new feature sets and sampling techniques to improve the robustness of a classification system, they do not incorporate domain-specific properties to reduce the negative impact of irrelevant features on AA. This study presents a novel scaling approach, namely, bucketed common vector scaling, to efficiently reduce negative domain influence without reducing the dimensionality of existing features; therefore, this approach is easily transferable and applicable in a classification system. Classification performances on English-language competition data sets consisting of emails and articles and Turkish-language web documents consisting of blogs, articles and tweets indicate that our approach is very competitive to top-performing approaches in English competition data sets and is significantly improving the top classification performance in mixed-domain experiments on blogs, articles and tweets.