Incorporating Topic Information in a Global Feature Selection Schema for Authorship Attribution


Creative Commons License

Agun H. V., YILMAZEL Ö.

IEEE ACCESS, cilt.7, ss.98522-98529, 2019 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 7
  • Basım Tarihi: 2019
  • Doi Numarası: 10.1109/access.2019.2930536
  • Dergi Adı: IEEE ACCESS
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
  • Sayfa Sayıları: ss.98522-98529
  • Anahtar Kelimeler: Authorship attribution, feature selection, text classification
  • Anadolu Üniversitesi Adresli: Evet

Özet

Authorship attribution (AA) is a stylometric analysis task of finding the author of an anonymous/disputed text document. In AA, the performance improvement of class-based feature selection schemas, such as Chi-square, and Gini index over frequency-based feature selection schemas, such as document frequency, common n-grams, and inverted document frequency has been shown to be limited. In AA, the feature selection process is significantly affected by topic distributions. In this paper, we assess the performance of a global feature selection approach into which the document's topic category is incorporated to scale the existing feature weights. In this approach, the common features of an author among different topics indicate higher relevance for the author and thus have higher weights. On the other hand, features with biased topic distributions are assumed to have high topic relevance and lower weights. In this approach, the global topic measure and the author specific topic measure are combined in order to scale the existing selection weights of the features. The ten-fold cross-validation experiment result on a multi-topic dataset with a random topic distribution indicates that our approach improves the performance of Chi-square, modified Gini index, and common n-grams schemas significantly in the best performing configurations of the classifiers.