Bucketed common vector scaling for authorship attribution in heterogeneous web collections: A scaling approach for authorship attribution


Agun H. V., YILMAZEL Ö.

JOURNAL OF INFORMATION SCIENCE, vol.46, no.5, pp.683-695, 2020 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 46 Issue: 5
  • Publication Date: 2020
  • Doi Number: 10.1177/0165551519863350
  • Journal Name: JOURNAL OF INFORMATION SCIENCE
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus, Academic Search Premier, FRANCIS, IBZ Online, ABI/INFORM, Aerospace Database, Analytical Abstracts, Applied Science & Technology Source, Business Source Elite, Business Source Premier, Communication Abstracts, Compendex, Computer & Applied Sciences, EBSCO Education Source, Education Abstracts, Index Islamicus, Information Science and Technology Abstracts, INSPEC, Library and Information Science Abstracts, Library Literature and Information Science, Metadex, Civil Engineering Abstracts, Library, Information Science & Technology Abstracts (LISTA)
  • Page Numbers: pp.683-695
  • Keywords: Authorship attribution, common vector approach, domain scaling, text classification, IDENTIFICATION
  • Anadolu University Affiliated: Yes

Abstract

Domain, genre and topic influences on author style adversely affect the performance of authorship attribution (AA) in multi-genre and multi-domain data sets. Although recent approaches to AA tasks focus on suggesting new feature sets and sampling techniques to improve the robustness of a classification system, they do not incorporate domain-specific properties to reduce the negative impact of irrelevant features on AA. This study presents a novel scaling approach, namely, bucketed common vector scaling, to efficiently reduce negative domain influence without reducing the dimensionality of existing features; therefore, this approach is easily transferable and applicable in a classification system. Classification performances on English-language competition data sets consisting of emails and articles and Turkish-language web documents consisting of blogs, articles and tweets indicate that our approach is very competitive to top-performing approaches in English competition data sets and is significantly improving the top classification performance in mixed-domain experiments on blogs, articles and tweets.