Stability of K-means clustering

Alexander Rakhlin, Andrea Caponnetto

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

68 Citations (Scopus)

Abstract

We phrase K-means clustering as an empirical risk minimization procedure over a class HK and explicitly calculate the covering number for this class. Next, we show that stability of K-means clustering is characterized by the geometry of H K with respect to the underlying distribution. We prove that in the case of a unique global minimizer, the clustering solution is stable with respect to complete changes of the data, while for the case of multiple minimizers, the change of Ω(n 1/2) samples defines the transition between stability and instability. While for a finite number of minimizers this result follows from multinomial distribution estimates, the case of infinite minimizers requires more refined tools. We conclude by proving that stability of the functions in H K implies stability of the actualcenters of the clusters. Since stability is often used for selecting the number of clusters in practice, we hope that our analysis serves as a starting point for finding theoretically grounded recipes for the choice of K.
Original languageEnglish
Title of host publicationAdvances in Neural Information Processing Systems
Pages1121-1128
Publication statusPublished - 2007
Externally publishedYes
Event20th Annual Conference on Neural Information Processing Systems, NIPS 2006 - Vancouver, BC, Canada
Duration: 4 Dec 20067 Dec 2006

Publication series

Name
ISSN (Print)1049-5258

Conference

Conference20th Annual Conference on Neural Information Processing Systems, NIPS 2006
PlaceCanada
CityVancouver, BC
Period4/12/067/12/06

Fingerprint

Dive into the research topics of 'Stability of K-means clustering'. Together they form a unique fingerprint.

Cite this