Diabetes prediction using feature selection and classification
Keywords:
Data mining, Feature selection, F-score, SVM classifier, K-means clustering.Abstract
Medical data mining is becoming increasingly important in healthcare. The diversity of
medical data collected/stored for diagnosis and prognosis and the availability of widespread data mining
techniques to process these data place medical data mining in a unique position to truly impact patient
care using these stored data. Medical data are high dimensional in nature. It contains irrelevant and
redundant features that reduce prediction accuracy so data pre-processing is required to prepare data for
mining task. Feature selection has been an active and fruitful field of research and development for
decades in statistical machine learning, data mining. It is effective in enhancing learning efficiency,
increasing predictive accuracy, and reducing complexity of learned results. Feature selection is the preprocessing technique that selects optimal feature subset from whole features. F-score method and Kmeans clustering is used for feature selection. The performance of the SVM classifier is empirically
evaluated on the reduced feature subset of Pima Indian diabetes dataset is one of the standard dataset
available at UCI machine learning laboratory used for testing data mining algorithms to see their
prediction accuracy in diabetes data classification.