Survey On A Similarity Measure For Text Classification And Clustering
Keywords:
Text Classification , Text Clustering , Clustering Algorithms , Preprocessing , Tokenization , StemmingAbstract
Computing the similarity between documents is an important operation in the text processing. In this paper, a
new similarity measure is proposed. To calculate the similarity between two documents with respect to a feature, the
proposed measure takes the following three cases in to account I) The same feature appears in both documents, II) The
same feature appears in only one document, and III) The same feature appears in none of the documents. For the first
case, the similarity will increases as the difference between the two involved feature values decreases. For the second
case, a fixed value is involved to the similarity. For the last case, the feature has no appearance to the similarity. The
proposed measure is extended to the similarity between the sets of documents. The effectiveness of our measure is
computed on the number of data sets for text clustering and classification. The performance obtained by the proposed
measure is better than achieved by other measures.