Clusterig overview

The goal in clustering is to take unlabeled data \((x_i)_{i=1}^n\), and produce a partition so that similar data is put in the same group, and dissimilar data is put in different groups.

Remark. Before giving more details, note why we are doing clustering:

There are many ways to formalize the search for a partition. There are two broad classes:

Remark. Before explaining this further, note that we had this same issue with classification. [ Note to future matus: we should have discussed it then! Whole course is designed around prediction, which is my own bias, which thus should be de-emphasized… ] Consider the problem of least squares, which we’ve only formulated as minimization of \[ w \mapsto \frac 1 2 \|Xw - Y\|_2^2. \]

In clustering, this distinction leads to similar complications.

Example. Suppose data is uniformly distributed on a circle. [ Picture drawn in class. ] Therefore if someone claims that some fixed partition/clustering is good, then any rotation of that clustering should also be good (because the data is invariant to rotations). So the fit/prediction problem is easy, but the recovery problem is impossible.

Remark. As discussed in class, the choice between a recovery and predicion setup is problem-dependent. For some problems, it’s really crucial to have some value of a key underlying parameter, and there are natural reasonable modeling assumptions.

Example. [ In class we also discussed mixtures of Gaussians. One interesting case that arises is distinguishing between a single wide gaussian, and a mixture of \(k\) tightly packed narrow gaussians; information-theoretically, \(2^k\) samples are required here for recovery, but of course the prediction/fit problem is trivial. (Ankur Moitra 2010, Figure 1) ]


To make things concrete, we’ll discuss the problem of \(k\)-means: we are given data \(X := (x_1,\ldots,x_n)\), and are expected to find \(k\) centers \(S:=(c_1,\ldots, c_k)\) in order to keep the cost \(\phi_X(S)\) small, defined as \[ \phi_X(S) := \sum_{x\in X} \min_{c\in S} \|x-c\|_2^2. \]


The standard algorithm is Lloyd’s method. [ note to future matus, give the proper reference. ]

  1. Start with some initial centers \(S_0 := (c_1^{(0)},\ldots,c_k^{(0)})\).

  2. For \(i \in \{ 1, \ldots, t\}\):

    1. Find a corresponding partition \((A_1,\ldots,A_k)\), breaking ties in a consistent way (i.e., set \(A_j := \{ x\in X : S_{i-1}(x) = c^{(i-1)}_j \}\)).

    2. Set \(c_j^{(i)} := \text{mean}(A_j)\) for all \(j\in \{1,\ldots,k\}\), and \(S_i := (c_1^{(i)}, \ldots, c_k^{(i)})\).


A very nice algorithm for this problem is \(\texttt{kmeans++}\). (The name was given in the second paper on the algorithm Arthur and Vassilvitskii (2007); the first paper was Ostrovsky et al. (2006).)

  1. Choose the first center \(c_1\) uniformly at random from \(X\)

  2. For \(i \in \{2,\ldots, k'\}\).

    1. Choose centers \(c_i\) randomly from \(X\) according to the following distribution: \(x\in X\) is choosen proportionally to \(\min_{j\leq i-1} \|x-c_j\|^2\).

There are a variety of guarantees for this method:


Ankur Moitra, Greg Valiant. 2010. “Settling the Polynomial Learnability of Mixtures of Gaussians.” In FOCS.

Arthur, D., and S. Vassilvitskii. 2007. “k-means++: The Advantages of Careful Seeding.” In SODA.

Ostrovsky, Rafail, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy. 2006. “The Effectiveness of Lloyd-Type Methods for the \(k\)-Means Problem.” In FOCS.