Thursday 10 December 2015

Faster way to count number of lines in a file/dir using Map Reduce Framework


In this site you can see one way to count number of lines in a file.
They are emitting count as one for each record in each map. So if 1 map holds 10,000 lines 10,000 values will be passed to reducer.If more than one mapper that many read-writes will happen.
Lets reduce the intermediate writes.

Below is an optimized way to count no of lines in a file/dir
Changes are done in
1. Mapper
Instead of emitting 'one' for each record, we increment line count in map and emit them in cleanup() phase.
public class LineCntMapper extends
  Mapper<LongWritable, Text, Text, IntWritable> {

 Text keyEmit = new Text("Total Lines");
 IntWritable valEmit = new IntWritable();
 int partialSum = 0;

 public void map(LongWritable key, Text value, Context context) {
  partialSum++;
 }

 public void cleanup(Context context) {
  valEmit.set(partialSum);
  try {
   context.write(keyEmit, valEmit);
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
   System.exit(0);
  } catch (InterruptedException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
   System.exit(0);
  }
 }
}
So if we have 5 map tasks we will only emit 5 intermediate key-value pair.

2. Driver
In Driver we will include a combiner also
job.setMapperClass(LineCntMapper.class);
job.setCombinerClass(LineCntReducer.class);
job.setReducerClass(LineCntReducer.class);
Combiner doesnt do nothing more than Reducer. we can use reducer as combiner itself.
Reducer doesnt need any change.

If you run this code you will get the results faster than the previous mentioned code in this site .

Working code is here

Happy Hadooping........

16 comments:

  1. For latest and updated Cloudera certification dumps in PDF format contact us at completeexamcollection@gmail.com.
    Refer our blog for more details http://completeexamcollection.blogspot.in/2015/04/cloudera-hadoop-certification-dumps.html

    ReplyDelete
  2. Hi Sreeveni,
    can you please share the list of interview questions or questions and answers kind of stuff if you have. my mail id is guresh.amara at gmail.com

    ReplyDelete
  3. Your blog is more informative..Its very interesting to know about new

    things..Keep on sharing.
    SEO training in chennai

    ReplyDelete
  4. This article describes the Hadoop Software, All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. This post gives great idea on Hadoop Certification for beginners. Also find best Hadoop Online Training in your locality at StaygreenAcademy.com

    ReplyDelete
  5. Such a great articles in my carrier, It's wonderful commands like easiest understand words of knowledge in information's.

    Aws Training in Chennai

    ReplyDelete
  6. Great post! I am actually getting ready to across this information, It's very helpful for this blog.Also great with all of the valuable information you have Keep up the good work you are doing well.

    Seo Company in Chennai

    ReplyDelete

  7. تتعدد الشركات التي تقدم خدمات ىالتنظيف لاكن لا يمكن ان تكون كلها في نفس مستوي الجوده فان كنت من الباحثين عن جودة الشركه قبل اي شئ اخر فانصحة بزيارة احدي تلك الصفحات
    شركة تنظيف مساجد بالرياض
    شركة تنظيف خزانات بالخرج
    شركة تنظيف بالخرج
    والتي تقدم افضل خدمات التنظيف بالمنزل باعلي مستوي من الكفائه
    شركة تنظيف منازل بالطائف

    ReplyDelete
  8. It's like you read my mind! You seem to know a lot about this, like you wrote the book in it or something. I think that you can do with some pics to drive the message home a little bit, but instead of that, this is fantastic blog. A great read. I will definitely be back.
    Office Interiors in Chennai
    Interior Decorators in Chennai

    ReplyDelete
  9. The blog gave me idea about components of Python.They explained in effective manner.Thanks for sharing it. Keep sharing more blogs.

    Hadoop Training in Chennai

    ReplyDelete
  10. Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking Big Data Hadoop Online Course Bangalore

    ReplyDelete
  11. Really very nice blog information for this one and more technical skills are improve,i like that kind of post.
    Microsoft azure training in Bangalore
    Power bi training in Chennai

    ReplyDelete
  12. Superb. I really enjoyed very much with this article here. Really it is an amazing article I had ever read. I hope it will help a lot for all. Thank you so much for this amazing posts and please keep update like this excellent article. thank you for sharing such a great blog with us.
    best rpa training in bangalore
    rpa training in pune | rpa course in bangalore
    RPA training in bangalore
    rpa training in chennai

    ReplyDelete
  13. This is such a great resource that you are providing and you give it away for free. free word counter

    ReplyDelete