Hybrid feature selection for text classification

Gunal, SERKAN

doi:10.3906/elk-1101-1064

Hybrid feature selection for text classification

TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, cilt.20, ss.1296-1311, 2012 (SCI-Expanded, Scopus, TRDizin)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 20
Basım Tarihi: 2012
Doi Numarası: 10.3906/elk-1101-1064
Dergi Adı: TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.1296-1311
Anahtar Kelimeler: Feature extraction, feature selection, pattern recognition, text classification, SUPPORT VECTOR MACHINES, LINEAR DISCRIMINANT-ANALYSIS, FEATURE SUBSET-SELECTION, PATTERN-RECOGNITION, SEMANTIC ANALYSIS, CATEGORIZATION, ALGORITHM, INFORMATION
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Anadolu Üniversitesi Adresli: Evet

Özet

Feature selection is vital in the field of pattern classification due to accuracy and processing time considerations. The selection of proper features is of greater importance when the initial feature set is considerably large. Text classification is a typical example of this situation, where the size of the initial feature set may reach to hundreds or even thousands. There are numerous research studies in the literature offering different feature selection strategies for text classification, mostly focused on filters. In spite of the extensive number of these studies, there is no significant work investigating the efficacy of a combination of features, which are selected by different selection methods, under different conditions. In this study, a hybrid feature selection strategy, which consists of both filter and wrapper feature selection steps, is proposed to comprehensively analyze the redundancy or relevancy of the text features selected by different methods in the case of different feature set sizes, dataset characteristics, classifiers, and success measures. The results of the experimental study reveal that a combination of the features selected by various methods is more effective than the features selected by the single selection method. The profile of the combination is, however, influenced by characteristics of the dataset, choice of the classification algorithm, and the success measure.