CUES: A New Hierarchical Approach for Document Clustering
The Journal of Pattern Recognition Research (JPRR) provides an international forum for the electronic publication of high-quality research and industrial experience articles in all areas of pattern recognition, machine learning, and artificial intelligence. JPRR is committed to rigorous yet rapid reviewing. Final versions are published electronically
(ISSN 1558-884X) immediately upon acceptance.
CUES: A New Hierarchical Approach for Document Clustering
Tanmay Basu, C.A. Murthy
JPRR Vol 8, No 1 (2013); doi:10.13176/11.459 
Download
Tanmay Basu, C.A. Murthy
Abstract
Objective of the document clustering techniques is to assemble similar documents and segregate dissimilar documents. Unlike document classification, no labeled documents are provided in document clustering. One of the main challenges of any document clustering algorithm is the selection of a good similarity measure. Traditionally, using the vector space model, the number of words common between two documents is used for determining their similarity. This paper introduces a document similarity measure, extensive similarity between the documents. In this approach two documents are considered to be similar if they share a minimum number of common words and they have almost same distance with every other document in the corpus i.e., both are either similar or dissimilar to the other documents. A hierarchical document clustering algorithm, using extensive similarity between the documents is proposed in this article. It is experimentally found on several text data sets that the proposed document clustering algorithm performs significantly better than the traditional document clustering techniques, comparisons for which are based on f-measure and normalized mutual information.
JPRR Vol 8, No 1 (2013); doi:10.13176/11.459 | Full Text  | Share this paper: