Monday, 4 May 2015

Hadoop Word Count Using C Language - Hadoop Streaming


Prerequisites
1. Hadoop (Example based on cloudera distribution cdh5)
2. gcc compiler

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. 

We need 2 programs mapper.c and reducer.c. You can find the code in GitHub.

1. Compile mapper.c and reducer.c
hadoop@namenode2:~/hadoopstreaming$ gcc -o mapper.out  mapper.c
hadoop@namenode2:~/hadoopstreaming$ gcc -o reducer.out  reducer.c
hadoop@namenode2:~/hadoopstreaming$ ls
mapper.c  mapper.out  reducer.c  reducer.out
Here you can see 2 executables mapper.out and reducer.out.

2. Place your wordcount input file in HDFS
hadoop@namenode2:~$ hadoop fs -put /home/hadoop/wc /
hadoop@namenode2:~$ hadoop fs -ls /
drwxr-xr-x   - hadoop hadoop         0 2015-05-04 15:50 /wc

3. Now we will run our C program in HDFS with the help of  Hadoop Streaming jar.
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.5.0-cdh5.3.1.jar 
-files hadoopstreaming/mapper.out -mapper hadoopstreaming/mapper.out 
-files hadoopstreaming/reducer.out -reducer hadoopstreaming/reducer.out 
-input /wc -output /wordcount-out

For Apache Hadoop
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-*streaming*.jar 
-files hadoopstreaming/mapper.out -mapper hadoopstreaming/mapper.out 
-files hadoopstreaming/reducer.out -reducer hadoopstreaming/reducer.out 
-input /wc -output /wordcount-out

Run the Job
hadoop@namenode2:~$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.5.0-cdh5.3.1.jar -files hadoopstreamingtrail/mapper.out -mapper hadoopstreamingtrail/mapper.out -files hadoopstreaming/reducer.out -reducer hadoopstreaming/reducer.out -input /wc -output /wordcount-out
packageJobJar: [hadoopstreaming/mapper.out, hadoopstreaming/reducer.out] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.5.0-cdh5.3.1.jar] /tmp/streamjob7616955264406618684.jar tmpDir=null
15/05/04 15:50:28 INFO mapred.FileInputFormat: Total input paths to process : 2
15/05/04 15:50:28 INFO mapreduce.JobSubmitter: number of splits:3
15/05/04 15:50:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1426753134244_0119
15/05/04 15:50:29 INFO impl.YarnClientImpl: Submitted application application_1426753134244_0119
15/05/04 15:50:29 INFO mapreduce.Job: Running job: job_1426753134244_0119
15/05/04 15:50:37 INFO mapreduce.Job:  map 0% reduce 0%
15/05/04 15:50:46 INFO mapreduce.Job:  map 67% reduce 0%
15/05/04 15:50:47 INFO mapreduce.Job:  map 100% reduce 0%
15/05/04 15:50:53 INFO mapreduce.Job:  map 100% reduce 100%
15/05/04 15:50:55 INFO mapreduce.Job: Job job_1426753134244_0119 completed successfully

4. Lets see the results
hadoop@namenode2:~$ hadoop fs -ls  /wordcount-out
Found 2 items
-rw-r--r--   3 hadoop hadoop          0 2015-05-04 15:50 /wordcount-out/_SUCCESS
-rw-r--r--   3 hadoop hadoop      11685 2015-05-04 15:50 /wordcount-out/part-00000


Happy Hadooping

15 comments:

  1. For latest and updated Cloudera certification dumps in PDF format contact us at completeexamcollection@gmail.com.
    Refer our blog for more details http://completeexamcollection.blogspot.in/2015/04/cloudera-hadoop-certification-dumps.html

    ReplyDelete
  2. Thanks for sharing this article.. You may also refer http://www.s4techno.com/blog/2016/07/11/hadoop-administrator-interview-questions/..

    ReplyDelete
  3. Thanks for providing this informative information…..
    You may also refer-
    http://www.s4techno.com/blog/category/hadoop/

    ReplyDelete
  4. Thanks for sharing Valuable information. Greatful Info about hadoop. Really helpful. Keep sharing........... If it possible share some more tutorials.........

    ReplyDelete
  5. Excellent .. Amazing .. I will bookmark your blog and take the feeds additionally? I’m satisfied to find so many helpful information here within the put up, we want work out extra strategies in this regard, thanks for sharing..

    Hadoop Training in Chennai

    Java Training in Chennai

    ReplyDelete
  6. Thanks for sharing this blog. This very important and informative blog
    Learned a lot of new things from your post! Good creation and HATS OFF to the creativity of your mind.
    Very interesting and useful blog!

    best Hadoop training in gurgaon


    ReplyDelete
  7. awesome post presented by you..your writing style is fabulous and keep update with your blogs
    Big data hadoop online Course

    ReplyDelete
  8. Hey, Wow all the posts are very informative for the people who visit this site. Good work! We also have a Website. Please feel free to visit our site. Thank you for sharing.Well written article Thank You Sharing with Us pmp training in chennai | pmp training institute in chennai | pmp training centers in chennai| pmp training in velachery | pmp training near me | pmp training courses online

    ReplyDelete
  9. This blog is full of Innovative ideas.surely i will look into this insight.please add more information's like this soon.
    AWS training courses near me
    AWS training in Chennai
    AWS Training Institutes in Vadapalani
    AWS training courses near me

    ReplyDelete
  10. such an effective blog you are posted.this blog is full of innovative ideas and i really like your information's. i expect more ideas from your site please add more details in future.
    Salesforce Training in Perambur
    Salesforce Training in Mogappair
    Salesforce Training in Ashok Nagar
    Salesforce Training in Nungambakkam

    ReplyDelete
  11. Very nice article, thanks for the information. : Best Digital Marketing Agency in Chennai offering Growth Strategies for Business. ✔️6+ Yrs. Industry Exp. ✔️Certified Professionals. ☎️ 8608954456
    best digital marketing services in chennai

    ReplyDelete
  12. So nice.. https://earningmoneyonlinefirst7.blogspot.com/

    ReplyDelete