Sunday, 7 December 2014

Joining Two Files Using MultipleInput In Hadoop MapReduce - MapSide Join

There are cases where we need to get 2 files as input and join them based on id or something like that.
Two different large data can be joined in map reduce programming also. Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join.  
MapSide can be achieved using MultipleInputFormat in Hadoop.

Say I have 2 files ,One file with EmployeeID,Name,Designation and another file with EmployeeID,Salary,Department.

File1.txt
1 Anne,Admin
2 Gokul,Admin
3 Janet,Sales
4 Hari,Admin

AND

File2.txt
1 50000,A
2 50000,B
3 60000,A
4 50000,C

We will try to join these files into one based on EmployeeID
The result we aim at is 

1 Anne,Admin,50000,A
2 Gokul,Admin,50000,B
3 Janet,Sales,60000,A
4 Hari,Admin,50000,C

Here in both file File1.txt,File2.txt we can see that we need to join the records based on id.  So the employeeId's are common.
We will write 2 map jobs to process these files.

Processing File1.txt
public void map(LongWritable k, Text value, Context context) throws IOException, InterruptedException
{
 String line=value.toString();
 String[] words=line.split("\t");
 keyEmit.set(words[0]);
 valEmit.set(words[1]);
 context.write(keyEmit, valEmit);
}

The above map job process File1.txt
String[] words=line.split("\t");
splits each line with \t space so words[0] will be the employeeId which we pass it as key and the rest as value.

eg: 1 Anne,Admin
words[0] = 1
words[1] = Anne,Admin

Or else you can also use KeyValueTextInputFormat.class as InputFormat. This class gives key as employeeId and the rest as value.
You dont need to split it.

Processing File2.txt
public void map(LongWritable k, Text v, Context context) throws IOException, InterruptedException
{
 String line=v.toString();
 String[] words=line.split(" ");
 keyEmit.set(words[0]);
 valEmit.set(words[1]);
 context.write(keyEmit, valEmit);
}

The above map job process File2.txt

eg: 1 50000,A
words[0] = 1
words[1] = 50000,A

If the files are of same delimiter and ID comes first you can resuse the same map job

Lets write a commomn Reducer task to join the data using key.
String merge = "";
public void reduce(Text key, Iterable<Text> values, Context context)
{
 int i =0;
 for(Text value:values)
 {
  if(i == 0){
   merge = value.toString()+",";
  }
  else{
   merge += value.toString();
  }
  i++;
 }
 valEmit.set(merge);
 context.write(key, valEmit);
}

Here we will be caching 1 data from a mapper and appends it to string "merge".
And emit employeeId as key and merge as value.

Now we need to furnish our Driver class to take 2 inputs and use MultipleInputFormat as InputFormat


public int run(String[] args) throws Exception {
 Configuration c=new Configuration();
 String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
 Path p1=new Path(files[0]);
 Path p2=new Path(files[1]);
 Path p3=new Path(files[2]);
 FileSystem fs = FileSystem.get(c);
 if(fs.exists(p3)){
  fs.delete(p3, true);
  }
 Job job = new Job(c,"Multiple Job");
 job.setJarByClass(MultipleFiles.class);
 MultipleInputs.addInputPath(job, p1, TextInputFormat.class, MultipleMap1.class);
 MultipleInputs.addInputPath(job,p2, TextInputFormat.class, MultipleMap2.class);
 job.setReducerClass(MultipleReducer.class);
 .
 .
}

MultipleInputs.addInputPath(job, p1, TextInputFormat.class, MultipleMap1.class);
MultipleInputs.addInputPath(job,p2, TextInputFormat.class, MultipleMap2.class);
p1,p2 are the Path variable holding 2 input files.
You can find the code in Github

There is one more case where we can make our output in a sequential manner.
Say if we need to get the output as below
1 Anne,Admin,50000,A
2 Gokul,Admin,50000,B
3 Janet,Sales,60000,A
4 Hari,Admin,50000,C
Inorder to achieve the same we can make use of TextPair Writable concepts in Hadoop.
You can find the working code in github . Thanks to one of  my blog reader Ravi Kumar who sorted out the sequence in order.

78 comments:

  1. There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.


    Hadoop training chennai velachery
    Hadoop training velachery
    Hadoop course in t nagar

    ReplyDelete
    Replies
    1. Build sophisticated applications leveraging the security of Amazon Web Services Cloud with the aid of our Amazon Web Services Certification Training Program. The AWS Training course preps you for the AWS Solution Architect Associate certification examinations. Master AWS Cloud Architecture, AWS EC2 services, AWS S3 services, AWS RDS and network service requirements and much more in this course. Join 360DigiTMG and enjoy the best AWS training in Hyderabad!.
      https://360digitmg.com/amazon-web-services-aws-training-in-hyderabad

      Delete
  2. At hadoop online training center we came to know about more other technologies like netezza, redshift, ETL and ELT with in depth insights. Similarly this website helped me a lot for the supplementary knowledge on those topics. Thanks for the info...

    ReplyDelete
  3. Hi Thank you for the valuable inputs, I tried running the driver and I am getting output as below

    1 50000,A,Anne,Admin
    2 Gokul,Admin,50000,B
    3 60000,A,Janet,Sales
    4 Hari,Admin,50000,C

    but I need the sequential output i.e

    1 Anne,Admin,50000,A
    2 Gokul,Admin,50000,B
    3 Janet,Sales,60000,A
    4 Hari,Admin,50000,C

    could you please help me with this

    ReplyDelete
    Replies
    1. Hi Shaila
      Thanks for reading and trying out.
      Inorder to acheive the order you can make use of Text Pair concepts in Hadoop.

      Blog post is updated with working code.

      Delete
    2. could you specify about Text Pair please?

      Delete
    3. This comment has been removed by the author.

      Delete
    4. I solved this problem by some dummy method... just add the table number before the output value in the Mapper class, like
      in your 1st mapper class:
      outValue.set("1" + othCol.toString());
      context.write(primaryKey,outValue);

      in your 2nd mapper class:
      outValue.set("2" + othCol.toString());
      context.write(primaryKey,outValue);

      in your reducer class:
      StringBuilder stb1 = new StringBuilder();
      StringBuilder stb2 = new StringBuilder();
      for(Text t:valueIt){
      if (t.toString().substring(0, 1).equals("1")){
      stb1.append(t.toString().substring(1,t.getLength()));
      }
      else if(t.toString().substring(0, 1).equals("2")){
      stb2.append(t.toString().substring(1,t.getLength()));
      }
      }

      Delete
  4. I was just wondering how I missed this article so far, this is a great piece of content I have ever seen in the entire Internet. Thanks for sharing this worth able information in here and do keep blogging like this.

    Hadoop Training Chennai | Hadoop Course in Chennai | Hadoop training institutes in chennai

    ReplyDelete
  5. Hi brother i have large 2 dataset all line (0,tcp,http,SF,335,10440,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,7,0.00,0.00,0.00,0.00,1.00,0.00,0.29,27,255,1.00,0.00,0.04,0.08,0.00,0.00,0.00,0.00,normal) should i create 42 keyemit to combine those 2 dataset?

    ReplyDelete
    Replies
    1. Yes,You will experience performance delay. You can try joining files using hive aswell

      Delete
    2. thanks for replay, i keep your project and i just change in MultipleMap1 and MultipleMap2 number of keyemit (in my case should i create 42 keyemit) and it will join the 2 datasets?

      Delete
  6. Array Out of Bond when i try to run your example with large dataset, any solution please?

    ReplyDelete
    Replies
    1. Are you using array to store values in Reducer?

      Delete
    2. No i'm not using array i just used your code and i tried to run it with my dataset i got this error array out ofbond

      Delete
  7. May I ask you for next task. File1.txt
    --------------
    Admin,Anne
    Admin,Gokul
    Sales,Janet
    --------------
    File1.txt
    --------------
    Anne,100
    Gokul,200
    Janet,300
    --------------
    In third file we should aggregate per each Position their summary salary. In our case:
    --------------
    Admin,300
    Sales,300
    --------------
    What is workflow in this case? To use MapReduce twise?(

    ReplyDelete
  8. hey bigboss...put tabspace in u r input files ...instead of directly copypasting the file. like below
    1 Anne,Admin
    2 Gokul,Admin
    3 Janet,Sales
    4 Hari,Admin

    ReplyDelete
  9. Thanks for providing this informative information…..
    You may also refer-
    http://www.s4techno.com/blog/category/hadoop/

    ReplyDelete
  10. Thanks for providing this informative information…..
    uml training in chennai

    ReplyDelete
  11. Hi Unmesha,
    I have similar problem to be solved, but slightly more complicated.
    I have several file1s and file2s coming from different servers in productions.
    Example: file1.1.txt, file1.2.txt, file1.3.txt etc
    file2.1.txt, file2.2.txt, file2.3.txt
    I have structured data in file1 and file 2 with few common columns in file1 and file2s.
    My questions are
    1) How will you define your Driver class? (can u use regular expression or something?).
    2) What happens, if on a given Hadoop instance/node, you have data as follows
    file1.txt
    1 sri,kon
    2 sai,kon

    file2.txt
    2 'kg'
    3 'pg'

    How will the map reduce prohram work? The file2 data for a corresponding key value in file1, may be on a different node? Vice versa.

    Thanks
    Sri

    ReplyDelete
  12. We share it this blog was really amazing. This blog informative was really useful to me. Selenium Training in Chennai

    ReplyDelete
  13. A1 Trainings as one of the best training institute in Hyderabad for online trainings for Hadoop. We have expertise and real time professionals working in Hadoop since 7 years. Our training strategy and materials will help the students for the certification exams also.

    Hadoop Training in Hyderabad

    ReplyDelete
  14. hi ,this blog led me to learn information on joining two files in hadoop by mapreduce method thanks for your blog Hadoop Training in Velachery | Hadoop Training .
    Hadoop Training in Chennai | Hadoop .

    ReplyDelete
  15. very informative blog and useful article thank you for sharing with us , keep posting Hadoop Admin Online Training

    ReplyDelete
  16. Thank you a lot for providing individuals with a very spectacular possibility to read critical reviews from this site.
    uipath training in bangalore

    ReplyDelete
  17. Those guidelines additionally worked to become a good way to recognize that other people online have the identical fervor like mine to grasp great deal more around this condition.

    Amazon Web Services Training in Chennai

    Best Java Training Institute Chennai


    ReplyDelete
  18. Thank you.Well it was nice post and very helpful information on Big data hadoop online training Hyderabad

    ReplyDelete
  19. Very good brief and this post helped me alot. Say thank you I searching for your facts. Thanks for sharing with us!

    angularjs-Training in velachery

    angularjs-Training in annanagar

    angularjs Training in chennai

    angularjs Training in chennai

    ReplyDelete
  20. Really nice experience you have. Thank you for sharing. It will surely be an experience to someone.
    python training in chennai
    python training in Bangalore
    Python training institute in chennai

    ReplyDelete
  21. Great Article… I love to read your articles because your writing style is too good, its is very very helpful for all of us and I never get bored while reading your article because, they are becomes a more and more interesting from the starting lines until the end.

    Devops Training in pune
    DevOps online Training

    ReplyDelete
  22. Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging.

    Data Science training in Chennai | Data science training in bangalore

    Data science training in pune | Data science online training

    Data Science Interview questions and answers


    ReplyDelete
  23. I found this informative and interesting blog so i think so its very useful and knowledge able.I would like to thank you for the efforts you have made in writing this article.
    Java training in Chennai | Java training in Tambaram | Java training in Chennai | Java training in Velachery

    Java training in Chennai | Java training in Omr | Oracle training in Chennai

    ReplyDelete
  24. Really very nice blog information for this one and more technical skills are improve,i like that kind of post.
    Best Devops Training in pune

    ReplyDelete
  25. Thank you for benefiting from time to focus on this kind of, I feel firmly about it and also really like comprehending far more with this particular subject matter. In case doable, when you get know-how, is it possible to thoughts modernizing your site together with far more details? It’s extremely useful to me.
    python training in chennai | python course institute in chennai

    ReplyDelete
  26. Thank you for taking the time and sharing this information with us. It was indeed very helpful and insightful while being straight forward and to the point.
    Java training in Bangalore | Java training in Marathahalli

    Java training in Bangalore | Java training in Btm layout

    Java training in Bangalore |Java training in Rajaji nagar

    ReplyDelete
  27. Some us know all relating to the compelling medium you present powerful steps on this blog and therefore strongly encourage
    contribution from other ones on this subject while our own child is truly discovering a great deal.
    Have fun with the remaining portion of the year.

    Selenium training in bangalore | best selenium training in bangalore

    ReplyDelete
  28. Being new to the blogging world I feel like there is still so much to learn. Your tips helped to clarify a few things for me as well as giving.sap s4 hana simple finance training in bangalore

    ReplyDelete
  29. We as a team of real-time industrial experience with a lot of knowledge in developing applications in python programming (7+ years) will ensure that we will deliver our best in python training in vijayawada. , and we believe that no one matches us in this context.

    ReplyDelete
  30. Thanks for Posting such an useful and nice info...

    Salesforce Training

    ReplyDelete
  31. nice article..Thank you for sharing such valuable infromation
    Data-science training in chennai

    ReplyDelete
  32. Effective blog with a lot of information. I just Shared you the link below for ACTE .They really provide good level of training and Placement,I just Had Hadoop Classes in ACTE , Just Check This Link You can get it more information about the Hadoop course.

    Java training in chennai | Java training in annanagar | Java training in omr | Java training in porur | Java training in tambaram | Java training in velachery

    ReplyDelete
  33. Machine Learning relies heavily on the available data. Therefore, they have a strong relationship with each other. So, we can say that both the terms are related. best machine learning course in hyderabad

    ReplyDelete
  34. Very informative blog and useful article thank you for sharing with us, keep posting learn more.
    By Cognex
    Cognex offers AWS Training and certification in chennai

    ReplyDelete
  35. Great Post, thanks for sharing such an amazing blog with us. Visit Ogen Infosystem for creative website design and PPC Services in Delhi, India.
    PPC Company in Delhi

    ReplyDelete
  36. Great post I would like to thank you for the efforts you have made in writing this interesting and knowledgeable article.
    Best Institute for Data Science in Hyderabad

    ReplyDelete
  37. Oh waoh! Nice Post, Thanks for sharing it, Visit
    Webocity is best website designing company in delhi , Best Website development company in Delhi, We Offer Best Digital Marketing services in Delhi.

    ReplyDelete
  38. Very awesome!!! When I searched for this I found this website at the top of all blogs in search engines.

    Best Institute for Data Science in Hyderabad


    ReplyDelete
  39. The knowledge of technology you have been sharing thorough this post is very much helpful to develop new idea. here by i also want to share this.
    AWS Training in Hyderabad
    AWS Course in Hyderabad

    ReplyDelete
  40. Such a very useful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article.
    data science training

    ReplyDelete
  41. I have been searching to find a comfort or effective procedure to complete this process and I think this is the most suitable way to do it effectively.
    data scientist training in malaysia

    ReplyDelete
  42. The Original Forex Trading System: tradeatf Is The Original Forex Trading System. It Is 100% Automated And Provides An Easy-to-follow Trading System. You Get Access To Real-time Signals, Proven Methods, And A Money-back Guarantee.

    ReplyDelete
  43. We Are A Devoted Team Of Forex Traders And Reviews Site Who Aim To Provide Unbiased And Detailed Reviews On All Major Forex Brokers Out There. At Iforexs We Have Spent The Last Few Years Researching, Reviewing, And Testing Forex Brokers So That We Can Help You Find One That Suits Your Trading Needs. With The Ever-increasing Number Of Forex Brokers Out There, We Understand It Is Hard To Know Where To Put Your Money – But We’re Here To Help!

    ReplyDelete
  44. We SVJ Technocoat are the leading Service Provider and Exporter of an extensive array of PVD Coating Service and Vapor Deposition Coating Service etc. We are a well known firm for providing excellent quality coating services across the nation and in a timely manner. Owing to our improvised business models, our professionals are offering integrated solutions for our clients.

    ReplyDelete
  45. In search of the Best MT4 Forex Brokers In Netherland ? The list we provide here is not exhaustive. We have chosen 5 of the best Forex brokers available in Netherland so you can compare them side by side.

    ReplyDelete
  46. Get a comprehensive overview of Data Science and learn all the essential skills including collecting, modeling, and interpreting data. Register with Data Science institute Bangalore and build a strong foundation for a career where you will be involved in uncovering valuable information for your organization. Learn Python, Machine Learning, Big Data, Deep Learning, and Analytics to take center stage in Data Science.

    Data Science Course in Bangalore with Placement

    ReplyDelete
  47. Learn to use analytics tools and techniques to manage and analyze large sets of data from Data Science training institutes in Bangalore. Learn to take on business challenges and solve problems by uncovering valuable insights from data. Learn from the comprehensively designed curriculum by the industry experts and work on live projects to sharpen your skills.


    Data Science Course in Delhi

    ReplyDelete
  48. Advance your technical skills required to crack huge datasets to bring out new possibilities from data. Join the Data Science institutes in Bangalore and get access to top industry trainers, LMS, live projects, assignments, and mock interviews to skyrocket your career in the ever- evolving field of Data Science.

    Data Scientist Course in Delhi

    ReplyDelete
  49. This comment has been removed by the author.

    ReplyDelete
  50. Fantastic article, thanks for sharing your knowledge with us

    Java training institution in Hyderabad

    ReplyDelete
  51. Thank you for sharing this Blog, This is more informative and nice.

    Python full Stack Developer Training in Hyderabad

    ReplyDelete
  52. Unified communications and Ip Pbx includes the connection of various communication systems both for the collaboration tools as the digital workforce.

    ReplyDelete
  53. This is nice and more informative, Thank you for sharing it.

    Azure DevOps Training in Hyderabad

    ReplyDelete

  54. Elevate your expertise in handling massive datasets to unlock innovative insights from data. Enroll in leading Data Science institutes in Bangalore, where you'll gain exposure to industry-leading instructors, a state-of-the-art Learning Management System (LMS), hands-on live projects, challenging assignments, and simulated interviews. Propel your career in the dynamic realm of Data Science with these comprehensive resources.

    Java Full Stack Developer Course In Marathahalli

    ReplyDelete