On classification of abstracts obtained from medical journals


JOURNAL OF INFORMATION SCIENCE, vol.46, no.5, pp.648-663, 2020 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 46 Issue: 5
  • Publication Date: 2020
  • Doi Number: 10.1177/0165551519860982
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Social Sciences Citation Index (SSCI), Scopus, Academic Search Premier, FRANCIS, IBZ Online, ABI/INFORM, Aerospace Database, Analytical Abstracts, Applied Science & Technology Source, Business Source Elite, Business Source Premier, Communication Abstracts, Compendex, Computer & Applied Sciences, EBSCO Education Source, Education Abstracts, Index Islamicus, Information Science and Technology Abstracts, INSPEC, Library and Information Science Abstracts, Library Literature and Information Science, Metadex, Civil Engineering Abstracts, Library, Information Science & Technology Abstracts (LISTA)
  • Page Numbers: pp.648-663
  • Keywords: Feature selection, medical documents, preprocessing, text classification, text representation, TEXT CLASSIFICATION, ALGORITHM, CATEGORIZATION, TERMS
  • Anadolu University Affiliated: Yes


Classification of medical documents was mostly carried out on English data sets and these studies were performed on hospital records rather than academic texts. The main reasons behind this situation are the lack of publicly available data sets and the tasks being costly and time-consuming. As the first contribution of this study, two data sets including Turkish and English counterparts of the same abstracts published in Turkish medical journals were constructed. Turkish is one of the widely used agglutinative languages worldwide and English is a good example of non-agglutinative languages. While English abstracts were obtained automatically from MEDLINE database with a computer program, Turkish counterparts of these documents were collected manually from the Internet. As the second contribution of this study, an extensive comparison on classification of abstracts obtained from Turkish medical journals was made by using these two equivalent data sets. Features were extracted from text documents with three different approaches: unigram, bigram and hybrid. Hybrid approach includes a combination of unigram and bigram features. In the experiments, three different feature selection methods and seven different classifiers were utilised. According to the results on both data sets, classification performance of the English abstracts outperformed the Turkish counterparts. Maximum accuracies were obtained from the combination of unigram features, distinguishing feature selector (DFS) and multinomial naive Bayes (MNB) classifier for both data sets. Unigram features were generally more efficient than bigram and hybrid features. However, analysis of top-10 features indicated that nearly half of the features were translations of each other for Turkish and English data sets.