CUES: A New Hierarchical Approach for Document Clustering
Tanmay Basu, C.A. Murthy
Abstract
Objective of the document clustering techniques is to assemble similar documents and segregate dissimilar documents. Unlike document classification, no labeled documents are provided in document clustering. One of the main challenges of any document clustering algorithm is the selection of a good similarity measure. Traditionally, using the vector space model, the number of words common between two documents is used for determining their similarity. This paper introduces a document similarity measure, extensive similarity between the documents. In this approach two documents are considered to be similar if they share a minimum number of common words and they have almost same distance with every other document in the corpus i.e., both are either similar or dissimilar to the other documents. A hierarchical document clustering algorithm, using extensive similarity between the documents is proposed in this article. It is experimentally found on several text data sets that the proposed document clustering algorithm performs significantly better than the traditional document clustering techniques, comparisons for which are based on f-measure and normalized mutual information.