A Similarity Measure for Text Classification and Clustering
Keywords:
Document classification, document clustering, entropy, accuracy, classifiers, clustering algorithmsAbstract
Clustering is one of the necessary techniques in machine learning and data mining techniques. Similar data
grouping is performed using clustering techniques. In document vector each component indicates the value of the
corresponding feature in the document. The characteristic measure could be term frequency, are similar to relative term
frequency. Similarity Measurement for Text Process (SMTP) is used to measure the similarity between two documents
with respect to a feature. Presents and options of the features in both documents are used to estimate the similarity
values. The SMTP is extended to estimate similarity between two set of documents. The SMTP scheme is used with text
clustering and classification task. K means algorithm is used for the clustering techniques.