Thursday 2 January 2014

K-means Clustering Algorithm


  How To Work Out K-means Clustering Algorithm


k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.Distance measure here used is Euclidean, there are many other Distance measures used for k-means clustering.

Lets workout k-means using a simple example.
Problem 1:
Lets take a medicine example.


Here we have  4 medicine as our training data. each medicine has 2 coordinates, which represent the coordinates of the object.These 4 medicines belong to two cluster.We need to identify which medicine belongs to cluster 1 and cluster 2.

k-means clustering starts with a k value where k is user defined ie how many clusters . here we need to classify the medicines into 2 clusters so k = 2.

Next we need to identify initial centroids. lets selects the first 2 points as our initial centroids.

 k = 2    Initial Centroids = (1,1) and (2,1)


Iteration 1:

For each feature we need to calculate the distance to the cluster and find the minimum distance.

For Feature  (1,1)
               C1 = (1,1)                                                                      C2 = (2,1)
                    = sqrt((1-1)^2  (1-1)^2)                                                     = sqrt((2-1)^2  (1-1)^2)
                    = 0                                                                                = 1
               
For Feature (2,1)
               C1 = (1,1)                                                                      C2 = (2,1)
                    = sqrt((1-2)^2  (1-1)^2)                                                     = sqrt((2-2)^2  (1-1)^2)
                    = 1                                                                                = 0

For Feature  (4,3)
               C1 = (1,1)                                                                      C2 = (2,1)
                    = sqrt((1-4)^2  (1-3)^2)                                                     = sqrt((2-4)^2  (1-3)^2)
                    = 3.6                                                                             = 2.82
               
For Feature (5,4)
               C1 = (1,1)                                                                      C2 = (2,1)
                    = sqrt((1-5)^2  (1-4)^2)                                                     = sqrt((2-5)^2  (1-4)^2)
                    = 5                                                                                = 4.24





Next we need to identify the cluster having minimum distance for each feature.


                  
                      C1 = (1,1)
                   C2 = (2,1),(4,3),(5,4)

Calculate the New Centroid till it converges.
New Centroids : 
Continue the iteration.for all features with new centroids.

                  C1 remains the same.
                  For C2 the new centroid will be the average of the above listed points.
                  ie   (2+4+5/3 , 1+3+4/3) 

                 C1 = (1,1)
                 C2 = (3.66,2.66)

Iteration 2: 

For each feature we need to calculate the distance to the cluster and find the minimum distance



Identify the cluster having minimum distance for each feature.


                       C1 = (1,1) , (2,1)
                   C2 = (4,3),(5,4)

New Centroids : 
                 C1 = (1.5,1)
                 C2 = (4.5,3.5)
Check if the points converges.
How to check: Compare the old centroid with new one.If both are equal then k-means converges else continue the iteration.
                     Old Centroid :   (1,1) and (3.66,2.66)
                     New Centroid:  (1.5,1) and (4.5,3.5) 

                     Old Centroid != New Centroid

Iteration continues.

Iteration 3: 

For each feature we need to calculate the distance to the cluster and find the minimum distance.


Identify the cluster having minimum distance for each feature.


                       C1 = (1,1) , (2,1)
                   C2 = (4,3),(5,4)

New Centroids : 
                 C1 = (1.5,1)
                 C2 = (4.5,3.5)

Check if the points converges.

                     Old Centroid :   (1.5,1) and (4.5,3.5)
                     New Centroid:  (1.5,1) and (4.5,3.5) 

                     Old Centroid = New Centroid

CONVERGES.
Therefore the algorithm stops.

Medicine A and B belongs to Cluster 1
Medicine C and D belongs to Cluster 2

   



3 comments:

  1. hiii
    good evening sreevani..kindly post the code in java for k-means algorithm in java

    where can i run?? is it on mahout or normal hadoop/mapreduce

    ReplyDelete
    Replies
    1. Siva you can find many source code related to kmeans over internet. Let me know if you need any help.

      Delete
    2. it not working
      ..i searched a lot
      please send the code if u have
      sivanagaraju46@gmail.com

      Delete