Monday 3 November 2014

K-Means Clustering in Mahout


Example shows Cloudera mahout (Hadoop 2.0.0-cdh4.5.0 with mahout-0.7)


1. Download the input data set


unmesha@client:~$ wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data

2. Place the data into HDFS under "testdata"
unmesha@client:~$ hadoop fs -mkdir testdata
unmesha@client:~$ echo $MAHOUT_HOME
/usr/lib/mahout/bin
unmesha@client:~$ $HADOOP_HOME/bin/hadoop fs -put /PATH/TO/synthetic_control.data testdata

*HDFS input directory name should be “testdata”



Run Kmeans Clustering

unmesha@client:~$ $MAHOUT_HOME/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

The result get stored in HDFS with "output" foldername

unmesha@client:~$ hadoop fs -ls output
Found 14 items
-rwxr-xr-x   1 unmesha unmesha        194 2014-11-04 09:06 output/_policy
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusteredPoints
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-0
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-1
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-10-final
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-2
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-3
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-4
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-5
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-6
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-7
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-8
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-9
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/data

The clustering output is in SequenceFile format which is not human readable. Mahout has a utility known as clusterdump which converts into human readable format.


Copy the cluster output from HDFS onto your local file system


unmesha@client:~$ hadoop fs -mkdir kmeansoutput
unmesha@client:~$ hadoop fs -get output kmeansoutput

unmesha@client:~$ mahout clusterdump --input output/clusters-10-final --pointsDir output/clusteredPoints --output kmeansoutput/clusteranalyze.txt

You can view the results now in kmeansoutput/clusteranalyze.txt



3 comments:

  1. after reading this blog i am very strong and clear in this topic.. this blog having clear explanation so easy to understand

    hadoop training in chennai velachery | big data training in chennai velachery

    ReplyDelete
  2. After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

    hadoop training in chennai adyar | big data training in chennai adyar

    ReplyDelete

  3. Great post dear. It definitely has increased my knowledge on R Programming. Please keep sharing similar write ups of yours. You can check this too for R Programming tutorial as i have recorded this recently on R Programming. and i'm sure it will be helpful to you.https://www.youtube.com/watch?v=gXb9ZKwx29U

    ReplyDelete