Validation in Statistics and Machine Learning - Abstract

Lamirel, Jean-Charles

Use of distance-based indexes might well lead to misinterpretation of clustering quality results

(joint work with Ghada Safi (Aleppo))

As soon as anyone aims at comparing clustering methods, he will be faced with the problem of choice of reliable clustering quality measures. The classical evaluation measures for the quality of a clustering are mainly based on the intra-cluster inertia and the inter-cluster inertia. Thanks to these two measures or their adaptations, like for example, the Davies-Bouldin index, the Kalinski-Harabaz or the Dunn indexes, a clustering result is considered as good if it possesses low intra-cluster inertia as compared to its inter-cluster inertia. As it has been shown in [Lamirel and Al Shehabi 2004], the inertia measures that are based on cluster profiles are often strongly biased and highly dependent on the clustering method. They can thus not be used for comparing different methods. Moreover, as it has been also shown in [Kassab and Lamirel 2008], they are often properly unable to identify an optimal clustering model whenever the dataset is constituted by complex data that must represented in a highly multidimensional and sparse description space. We will demonstrate that the classical measures might even well lead to clustering quality results misinterpretation if there are used to identify both an optimal model and an optimal method when a heterogeneous dataset constituted by sparse data is considered as a clustering input. To properly identify the clustering quality and to clearly highlight the defects of the classical distance-based indexes, we have used an original approach that takes its inspiration from the behavior of symbolic classifiers with the advantage to be independent of the clustering methods and to their operating mode [Lamirel and al. 2004]. The Recall and Precision measures which we introduce in our approach evaluate the quality of a clustering method in an unsupervised way by measuring the relevance of clusters content in terms of shared properties. Some extension of these measures based on clusters data properties maximization permitted us to more precisely highlight clusters content and overall clustering quality while conjointly allowing to measure the noise produced by a clustering method. For that, properties that are co-occurring in the data associated to a cluster are gathered in peculiar connected properties groups.

To perform our experiment, we used a highly heterogeneous and poly-thematic textual dataset constituted of 1341 bibliographic records issued from the INIST PASCAL database and covering one year of research performed in the Lorraine region. The dataset thus included a large bunch of different topics, as far one to another as medicine from structural physics or forest cultivation... The resulting data description space is a sparse space constituted by 889 keywords with an average of app. 5 indexing keywords per data.

The tested methods range from standard reference methods like, K-means method [McQueen 1967] or SOM method [Kohonen 1982], to more recent methods like, Neural Gas [Martinetz 1991], Growing Neural Gas [Fristke 1995], or Incremental Growing Neural Gas [Pudent and Ennaji 2005], or either graph-based methods, like the Wallktrap method [Pons and Latapi 2006].