TY - JOUR
T1 - A topological view of unsupervised learning from noisy data
AU - Niyogi, P.
AU - Smale, S.
AU - Weinberger, S.
PY - 2011
Y1 - 2011
N2 - In this paper, we take a topological view of unsupervised learning. From this point of view, clustering may be interpreted as trying to find the number of connected components of any underlying geometrically structured probability distribution in a certain sense that we will make precise. We construct a geometrically structured probability distribution that seems appropriate for modeling data in very high dimensions. A special case of our construction is the mixture of Gaussians where there is Gaussian noise concentrated around a finite set of points (the means). More generally we consider Gaussian noise concentrated around a low dimensional manifold and discuss how to recover the homology of this underlying geometric core from data that do not lie on it. We show that if the variance of the Gaussian noise is small in a certain sense, then the homology can be learned with high confidence by an algorithm that has a weak (linear) dependence on the ambient dimension. Our algorithm has a natural interpretation as a spectral learning algorithm using a combinatorial Laplacian of a suitable data-derived simplicial complex. © 2011 Society for Industrial and Applied Mathematics.
AB - In this paper, we take a topological view of unsupervised learning. From this point of view, clustering may be interpreted as trying to find the number of connected components of any underlying geometrically structured probability distribution in a certain sense that we will make precise. We construct a geometrically structured probability distribution that seems appropriate for modeling data in very high dimensions. A special case of our construction is the mixture of Gaussians where there is Gaussian noise concentrated around a finite set of points (the means). More generally we consider Gaussian noise concentrated around a low dimensional manifold and discuss how to recover the homology of this underlying geometric core from data that do not lie on it. We show that if the variance of the Gaussian noise is small in a certain sense, then the homology can be learned with high confidence by an algorithm that has a weak (linear) dependence on the ambient dimension. Our algorithm has a natural interpretation as a spectral learning algorithm using a combinatorial Laplacian of a suitable data-derived simplicial complex. © 2011 Society for Industrial and Applied Mathematics.
KW - Data
KW - Manifolds
KW - Topology
UR - http://www.scopus.com/inward/record.url?scp=79960379431&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-79960379431&origin=recordpage
U2 - 10.1137/090762932
DO - 10.1137/090762932
M3 - RGC 21 - Publication in refereed journal
VL - 40
SP - 646
EP - 663
JO - SIAM Journal on Computing
JF - SIAM Journal on Computing
SN - 0097-5397
IS - 3
ER -