Thursday, 2 January 2014

K-means Clustering Algorithm


  How To Work Out K-means Clustering Algorithm


k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.Distance measure here used is Euclidean, there are many other Distance measures used for k-means clustering.

Lets workout k-means using a simple example.
Problem 1:
Lets take a medicine example.


Here we have  4 medicine as our training data. each medicine has 2 coordinates, which represent the coordinates of the object.These 4 medicines belong to two cluster.We need to identify which medicine belongs to cluster 1 and cluster 2.

k-means clustering starts with a k value where k is user defined ie how many clusters . here we need to classify the medicines into 2 clusters so k = 2.

Next we need to identify initial centroids. lets selects the first 2 points as our initial centroids.

 k = 2    Initial Centroids = (1,1) and (2,1)


Iteration 1:

For each feature we need to calculate the distance to the cluster and find the minimum distance.

For Feature  (1,1)
               C1 = (1,1)                                                                      C2 = (2,1)
                    = sqrt((1-1)^2  (1-1)^2)                                                     = sqrt((2-1)^2  (1-1)^2)
                    = 0                                                                                = 1
               
For Feature (2,1)
               C1 = (1,1)                                                                      C2 = (2,1)
                    = sqrt((1-2)^2  (1-1)^2)                                                     = sqrt((2-2)^2  (1-1)^2)
                    = 1                                                                                = 0

For Feature  (4,3)
               C1 = (1,1)                                                                      C2 = (2,1)
                    = sqrt((1-4)^2  (1-3)^2)                                                     = sqrt((2-4)^2  (1-3)^2)
                    = 3.6                                                                             = 2.82
               
For Feature (5,4)
               C1 = (1,1)                                                                      C2 = (2,1)
                    = sqrt((1-5)^2  (1-4)^2)                                                     = sqrt((2-5)^2  (1-4)^2)
                    = 5                                                                                = 4.24





Next we need to identify the cluster having minimum distance for each feature.


                  
                      C1 = (1,1)
                   C2 = (2,1),(4,3),(5,4)

Calculate the New Centroid till it converges.
New Centroids : 
Continue the iteration.for all features with new centroids.

                  C1 remains the same.
                  For C2 the new centroid will be the average of the above listed points.
                  ie   (2+4+5/3 , 1+3+4/3) 

                 C1 = (1,1)
                 C2 = (3.66,2.66)

Iteration 2: 

For each feature we need to calculate the distance to the cluster and find the minimum distance



Identify the cluster having minimum distance for each feature.


                       C1 = (1,1) , (2,1)
                   C2 = (4,3),(5,4)

New Centroids : 
                 C1 = (1.5,1)
                 C2 = (4.5,3.5)
Check if the points converges.
How to check: Compare the old centroid with new one.If both are equal then k-means converges else continue the iteration.
                     Old Centroid :   (1,1) and (3.66,2.66)
                     New Centroid:  (1.5,1) and (4.5,3.5) 

                     Old Centroid != New Centroid

Iteration continues.

Iteration 3: 

For each feature we need to calculate the distance to the cluster and find the minimum distance.


Identify the cluster having minimum distance for each feature.


                       C1 = (1,1) , (2,1)
                   C2 = (4,3),(5,4)

New Centroids : 
                 C1 = (1.5,1)
                 C2 = (4.5,3.5)

Check if the points converges.

                     Old Centroid :   (1.5,1) and (4.5,3.5)
                     New Centroid:  (1.5,1) and (4.5,3.5) 

                     Old Centroid = New Centroid

CONVERGES.
Therefore the algorithm stops.

Medicine A and B belongs to Cluster 1
Medicine C and D belongs to Cluster 2