23nd Signal Processing and Communications Applications Conference (SIU), Malatya, Turkey, 16 - 19 May 2015, pp.1635-1638
Medical text classification is still one of the popular research problems inside text classification domain. Apart from some text data compiled from hospital records, most of the researchers in this field evaluate their classification methodologies on documents from MEDLINE database. When whole documents in the database are taken into consideration, MEDLINE is a multi-class and multi-label database. A dataset, containing a small subset of MEDLINE documents belonging to disease categories, is constructed in this study. It is a multi-class but single-label dataset. Due to the highly unbalanced distribution of this dataset, only documents belonging to top-10 disease categories are used in the experiments. The performances of three different pattern classifiers are analyzed on disease classification problem using this dataset. These three pattern classifiers are Bayesian network, C4.5 decision tree, and Random Forest trees. Experiments are realized for the two different cases where the stemming preprocessing step is applied or not. Experimental results show that the most successful classifier among three classifiers is Bayesian network classifier. Also, the best performance is obtained without applying stemming.