Printer-friendly version

This is an algorithm developed in the community of vector quantization for the purpose of data compression. Here's the article where this algorithm is developed.

Y. Linde, A. Buzo and R. M. Gray, ”An algorithm for vector quantizer design,”

IEEE Trans. on Communication, Vol. COM-28, pp. 84-95, Jan. 1980.

As we had mentioned before k-means is known as the Lloyd by electrical. Linde, Buzo and Gray extended Lloyd's algorithm by a careful approach to initialization and often achieved better performance in terms of minimizing the total within class distance.

Here is how this algorithm works:

Step 1. First, you find the sample mean or what we call the centroidz_{1}^{(1)}for the entire data set. This is like if we were to have only one prototype. The sample mean is proven to minimize the total within class distance (total mean square distortion) for a single prototype.

Step 2. Setk= 1,l= 1.lis the index for the iteration.kcounts the number of prototypes that have been generated. After the first step, we've only generated one prototype.

Step 3. Ifk<M, (Mbeing the target number of centroids that we're trying to get), split the current centroids by adding small offsets. Let's see how this is done.Since if we already have

kprototypes, we needM−kadditional prototypes.

- If
M−k≥k, split all the existing centroids that have been created so far; otherwise, we split onlyM−kof them.We will denote the number of centroids which are split by = min(

k,M−k), where is the smaller of the two values.How do we make this split?

For example, to split

z_{1}^{(1)}into two centroids, letz_{1}^{(2)}=z_{1}^{(1)},z_{2}^{(2)}=z_{1}^{(1)}+ ε, where ε is a small offset and has a small norm and a random direction.

Step 4.k←k+ ; l ← l + 1. The splitting has taken place.

Step 5. Use {z_{1}^{(1)},z_{2}^{(1)}, ... ,z_{k}^{(1)}} as initial prototypes, which includes the previously generated centroids and the newly split centroids. Apply k-means to refine these prototypes.

Step 6. Check whether the number of prototypes has reached the target number of prototypes. In other words, ifk<M, go back to step 3; otherwise, stop.