Bucketed common vector scaling for authorship attribution in heterogeneous web collections: A scaling approach for authorship attribution

Agun, Hayri; YILMAZEL, ÖZGÜR

doi:10.1177/0165551519863350

Bucketed common vector scaling for authorship attribution in heterogeneous web collections: A scaling approach for authorship attribution

Atıf İçin Kopyala

Agun H. V., YILMAZEL Ö.

JOURNAL OF INFORMATION SCIENCE, cilt.46, sa.5, ss.683-695, 2020 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 46 Sayı: 5
Basım Tarihi: 2020
Doi Numarası: 10.1177/0165551519863350
Dergi Adı: JOURNAL OF INFORMATION SCIENCE
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus, Academic Search Premier, FRANCIS, IBZ Online, ABI/INFORM, Aerospace Database, Analytical Abstracts, Applied Science & Technology Source, Business Source Elite, Business Source Premier, Communication Abstracts, Compendex, Computer & Applied Sciences, EBSCO Education Source, Education Abstracts, Index Islamicus, Information Science and Technology Abstracts, INSPEC, Library and Information Science Abstracts, Library Literature and Information Science, Metadex, Civil Engineering Abstracts, Library, Information Science & Technology Abstracts (LISTA)
Sayfa Sayıları: ss.683-695
Anahtar Kelimeler: Authorship attribution, common vector approach, domain scaling, text classification, IDENTIFICATION
Anadolu Üniversitesi Adresli: Evet

Özet

Domain, genre and topic influences on author style adversely affect the performance of authorship attribution (AA) in multi-genre and multi-domain data sets. Although recent approaches to AA tasks focus on suggesting new feature sets and sampling techniques to improve the robustness of a classification system, they do not incorporate domain-specific properties to reduce the negative impact of irrelevant features on AA. This study presents a novel scaling approach, namely, bucketed common vector scaling, to efficiently reduce negative domain influence without reducing the dimensionality of existing features; therefore, this approach is easily transferable and applicable in a classification system. Classification performances on English-language competition data sets consisting of emails and articles and Turkish-language web documents consisting of blogs, articles and tweets indicate that our approach is very competitive to top-performing approaches in English competition data sets and is significantly improving the top classification performance in mixed-domain experiments on blogs, articles and tweets.