DOI: 10.5176/2251-3833_GHC16.13
Authors: Yu-Tsung Tsai, Bor-Wen Cheng, Cheng-Min Chao, Hsien-Jy Ma and Yao-Lung Kuo
Abstract: As medical technology advances, the amount of medical data also increases. As a result data mining technology has become an essential instrument for hospital management and medical research to deal with this accumulation. The possible meaningful information can be discerned from the database by data mining. Further medical research then can make use of this important data. The aim of this study is to compare three data mining methods and look for the determining factors of Breast Cancer. The data was collected from hospital in central Taiwan.There were 1357 cases and 7 variables. The dataset was divided into 10 categories of training sets and testing sets. Three popular data mining algorithms (decision tree C5.0, support vector machines (SVM), and logistic regression) were used to predicted the patients' survival and death. The results showed that decision tree C5.0 outperformed then SVM and logistic regression. In training sets, the decision tree C5.0 achieved a classification accuracy of 95.8{6e6090cdd558c53a8bc18225ef4499fead9160abd3419ad4f137e902b483c465} with a sensitivity of 97.7{6e6090cdd558c53a8bc18225ef4499fead9160abd3419ad4f137e902b483c465} and a specificity of 94.7{6e6090cdd558c53a8bc18225ef4499fead9160abd3419ad4f137e902b483c465}. In testing sets, the decision tree C5.0 achieved a classification accuracy of 94.9{6e6090cdd558c53a8bc18225ef4499fead9160abd3419ad4f137e902b483c465} with a sensitivity of 95.7{6e6090cdd558c53a8bc18225ef4499fead9160abd3419ad4f137e902b483c465} and a specificity of 94.3{6e6090cdd558c53a8bc18225ef4499fead9160abd3419ad4f137e902b483c465}. The result suggested that decision tree C5.0 could derive a better prognosis model in practice. Our findings are expected to provide references of new cases for doctors.
Keywords: Fold cross-validation Logistic Regression Decision tree C5.0 Support vector machines (SVM) Breast Cancer
