Example shows Cloudera mahout (Hadoop 2.0.0-cdh4.5.0 with mahout-0.7)
1. Download the input data set
unmesha@client:~$ wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
2. Place the data into HDFS under "testdata"
unmesha@client:~$ hadoop fs -mkdir testdata
unmesha@client:~$ echo $MAHOUT_HOME
/usr/lib/mahout/bin
unmesha@client:~$ $HADOOP_HOME/bin/hadoop fs -put /PATH/TO/synthetic_control.data testdata
*HDFS input directory name should be “testdata”
Run Kmeans Clustering
unmesha@client:~$ $MAHOUT_HOME/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
The result get stored in HDFS with "output" foldername
unmesha@client:~$ hadoop fs -ls output
Found 14 items
-rwxr-xr-x 1 unmesha unmesha 194 2014-11-04 09:06 output/_policy
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusteredPoints
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-0
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-1
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-10-final
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-2
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-3
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-4
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-5
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-6
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-7
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-8
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-9
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/data
The clustering output is in SequenceFile format which is not human readable. Mahout has a utility known as clusterdump which converts into human readable format.
Copy the cluster output from HDFS onto your local file system
unmesha@client:~$ hadoop fs -mkdir kmeansoutput unmesha@client:~$ hadoop fs -get output kmeansoutput unmesha@client:~$ mahout clusterdump --input output/clusters-10-final --pointsDir output/clusteredPoints --output kmeansoutput/clusteranalyze.txt
after reading this blog i am very strong and clear in this topic.. this blog having clear explanation so easy to understand
ReplyDeletehadoop training in chennai velachery | big data training in chennai velachery
After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog
ReplyDeletehadoop training in chennai adyar | big data training in chennai adyar
ReplyDeleteGreat post dear. It definitely has increased my knowledge on R Programming. Please keep sharing similar write ups of yours. You can check this too for R Programming tutorial as i have recorded this recently on R Programming. and i'm sure it will be helpful to you.https://www.youtube.com/watch?v=gXb9ZKwx29U