A Comparative Approach to Cluster Validation
Renata Avros, Mati Golani, Zeev Volkovich
Abstract
The estimation of the appropriate number of clusters is a known problem in cluster analysis, that affects the clusters stability. A stable cluster contains a majority of instances that are mapped correctly to this cluster. Given a stable clustering result, we believe that by re-clustering matched pairs of samples taken from the same cluster, would result with similar result. Implementation of this approach employs own technique intended to construct repeated clustering for the same data. A solutions likeness is measured by means of the total misclassification rate among the sample unit, and each one of the samples. Another employed characteristic is the adjusted Rand index calculated between the unit source clustering, and one induced by the individual samples partitions. Drawbacks of clustering algorithms together with the data inconsistency can seriously increase the model uncertainty. Thus, the inference according to the “true";; number of clusters is based on a sufficiently large amount of information, and the process is repeated many times for constructing empirical distributions of the disagree rate and the adjusted Rand coefficient. The “true";; number of clusters corresponds to the rate empirical distribution having minimal right tail and the index distribution having minimal left tail. Numerical experiments demonstrate high ability of the proposed methodology.