Sunday, 7 December 2014

Joining Two Files Using MultipleInput In Hadoop MapReduce - MapSide Join

There are cases where we need to get 2 files as input and join them based on id or something like that.
Two different large data can be joined in map reduce programming also. Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join.  
MapSide can be achieved using MultipleInputFormat in Hadoop.

Say I have 2 files ,One file with EmployeeID,Name,Designation and another file with EmployeeID,Salary,Department.

File1.txt
1 Anne,Admin
2 Gokul,Admin
3 Janet,Sales
4 Hari,Admin

AND

File2.txt
1 50000,A
2 50000,B
3 60000,A
4 50000,C

We will try to join these files into one based on EmployeeID
The result we aim at is 

1 Anne,Admin,50000,A
2 Gokul,Admin,50000,B
3 Janet,Sales,60000,A
4 Hari,Admin,50000,C

Here in both file File1.txt,File2.txt we can see that we need to join the records based on id.  So the employeeId's are common.
We will write 2 map jobs to process these files.

Processing File1.txt
public void map(LongWritable k, Text value, Context context) throws IOException, InterruptedException
{
 String line=value.toString();
 String[] words=line.split("\t");
 keyEmit.set(words[0]);
 valEmit.set(words[1]);
 context.write(keyEmit, valEmit);
}

The above map job process File1.txt
String[] words=line.split("\t");
splits each line with \t space so words[0] will be the employeeId which we pass it as key and the rest as value.

eg: 1 Anne,Admin
words[0] = 1
words[1] = Anne,Admin

Or else you can also use KeyValueTextInputFormat.class as InputFormat. This class gives key as employeeId and the rest as value.
You dont need to split it.

Processing File2.txt
public void map(LongWritable k, Text v, Context context) throws IOException, InterruptedException
{
 String line=v.toString();
 String[] words=line.split(" ");
 keyEmit.set(words[0]);
 valEmit.set(words[1]);
 context.write(keyEmit, valEmit);
}

The above map job process File2.txt

eg: 1 50000,A
words[0] = 1
words[1] = 50000,A

If the files are of same delimiter and ID comes first you can resuse the same map job

Lets write a commomn Reducer task to join the data using key.
String merge = "";
public void reduce(Text key, Iterable<Text> values, Context context)
{
 int i =0;
 for(Text value:values)
 {
  if(i == 0){
   merge = value.toString()+",";
  }
  else{
   merge += value.toString();
  }
  i++;
 }
 valEmit.set(merge);
 context.write(key, valEmit);
}

Here we will be caching 1 data from a mapper and appends it to string "merge".
And emit employeeId as key and merge as value.

Now we need to furnish our Driver class to take 2 inputs and use MultipleInputFormat as InputFormat


public int run(String[] args) throws Exception {
 Configuration c=new Configuration();
 String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
 Path p1=new Path(files[0]);
 Path p2=new Path(files[1]);
 Path p3=new Path(files[2]);
 FileSystem fs = FileSystem.get(c);
 if(fs.exists(p3)){
  fs.delete(p3, true);
  }
 Job job = new Job(c,"Multiple Job");
 job.setJarByClass(MultipleFiles.class);
 MultipleInputs.addInputPath(job, p1, TextInputFormat.class, MultipleMap1.class);
 MultipleInputs.addInputPath(job,p2, TextInputFormat.class, MultipleMap2.class);
 job.setReducerClass(MultipleReducer.class);
 .
 .
}

MultipleInputs.addInputPath(job, p1, TextInputFormat.class, MultipleMap1.class);
MultipleInputs.addInputPath(job,p2, TextInputFormat.class, MultipleMap2.class);
p1,p2 are the Path variable holding 2 input files.
You can find the code in Github

There is one more case where we can make our output in a sequential manner.
Say if we need to get the output as below
1 Anne,Admin,50000,A
2 Gokul,Admin,50000,B
3 Janet,Sales,60000,A
4 Hari,Admin,50000,C
Inorder to achieve the same we can make use of TextPair Writable concepts in Hadoop.
You can find the working code in github . Thanks to one of  my blog reader Ravi Kumar who sorted out the sequence in order.

34 comments:

  1. There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.


    Hadoop training chennai velachery
    Hadoop training velachery
    Hadoop course in t nagar

    ReplyDelete
  2. At hadoop online training center we came to know about more other technologies like netezza, redshift, ETL and ELT with in depth insights. Similarly this website helped me a lot for the supplementary knowledge on those topics. Thanks for the info...

    ReplyDelete
  3. Hi Thank you for the valuable inputs, I tried running the driver and I am getting output as below

    1 50000,A,Anne,Admin
    2 Gokul,Admin,50000,B
    3 60000,A,Janet,Sales
    4 Hari,Admin,50000,C

    but I need the sequential output i.e

    1 Anne,Admin,50000,A
    2 Gokul,Admin,50000,B
    3 Janet,Sales,60000,A
    4 Hari,Admin,50000,C

    could you please help me with this

    ReplyDelete
    Replies
    1. Hi Shaila
      Thanks for reading and trying out.
      Inorder to acheive the order you can make use of Text Pair concepts in Hadoop.

      Blog post is updated with working code.

      Delete
    2. could you specify about Text Pair please?

      Delete
    3. This comment has been removed by the author.

      Delete
    4. I solved this problem by some dummy method... just add the table number before the output value in the Mapper class, like
      in your 1st mapper class:
      outValue.set("1" + othCol.toString());
      context.write(primaryKey,outValue);

      in your 2nd mapper class:
      outValue.set("2" + othCol.toString());
      context.write(primaryKey,outValue);

      in your reducer class:
      StringBuilder stb1 = new StringBuilder();
      StringBuilder stb2 = new StringBuilder();
      for(Text t:valueIt){
      if (t.toString().substring(0, 1).equals("1")){
      stb1.append(t.toString().substring(1,t.getLength()));
      }
      else if(t.toString().substring(0, 1).equals("2")){
      stb2.append(t.toString().substring(1,t.getLength()));
      }
      }

      Delete
  4. I was just wondering how I missed this article so far, this is a great piece of content I have ever seen in the entire Internet. Thanks for sharing this worth able information in here and do keep blogging like this.

    Hadoop Training Chennai | Hadoop Course in Chennai | Hadoop training institutes in chennai

    ReplyDelete
  5. For latest and updated Cloudera certification dumps in PDF format contact us at completeexamcollection@gmail.com.
    Refer our blog for more details http://completeexamcollection.blogspot.in/2015/04/cloudera-hadoop-certification-dumps.html

    ReplyDelete
  6. Really awesome blog. Your blog is really useful for me.
    Thanks for sharing this informative blog. Keep update your blog.
    Oracle Training In Chennai

    ReplyDelete
  7. For latest and updated Cloudera certification dumps in PDF format contact us at completeexamcollection@gmail.com.
    Refer our blog for more details http://completeexamcollection.blogspot.in/2015/04/cloudera-hadoop-certification-dumps.html

    ReplyDelete
  8. Best Java Training Institute In ChennaiThis information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic..

    ReplyDelete
  9. Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
    Oracle Training In Chennai

    ReplyDelete
  10. For latest and updated Cloudera certification dumps in PDF format contact us at completeexamcollection@gmail.com.
    Refer our blog for more details http://completeexamcollection.blogspot.in/2015/04/cloudera-hadoop-certification-dumps.html

    ReplyDelete
  11. Hi brother i have large 2 dataset all line (0,tcp,http,SF,335,10440,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,7,0.00,0.00,0.00,0.00,1.00,0.00,0.29,27,255,1.00,0.00,0.04,0.08,0.00,0.00,0.00,0.00,normal) should i create 42 keyemit to combine those 2 dataset?

    ReplyDelete
    Replies
    1. Yes,You will experience performance delay. You can try joining files using hive aswell

      Delete
    2. thanks for replay, i keep your project and i just change in MultipleMap1 and MultipleMap2 number of keyemit (in my case should i create 42 keyemit) and it will join the 2 datasets?

      Delete
  12. Array Out of Bond when i try to run your example with large dataset, any solution please?

    ReplyDelete
    Replies
    1. Are you using array to store values in Reducer?

      Delete
    2. No i'm not using array i just used your code and i tried to run it with my dataset i got this error array out ofbond

      Delete
  13. May I ask you for next task. File1.txt
    --------------
    Admin,Anne
    Admin,Gokul
    Sales,Janet
    --------------
    File1.txt
    --------------
    Anne,100
    Gokul,200
    Janet,300
    --------------
    In third file we should aggregate per each Position their summary salary. In our case:
    --------------
    Admin,300
    Sales,300
    --------------
    What is workflow in this case? To use MapReduce twise?(

    ReplyDelete
  14. hey bigboss...put tabspace in u r input files ...instead of directly copypasting the file. like below
    1 Anne,Admin
    2 Gokul,Admin
    3 Janet,Sales
    4 Hari,Admin

    ReplyDelete
  15. Thanks for providing this informative information…..
    You may also refer-
    http://www.s4techno.com/blog/category/hadoop/

    ReplyDelete
  16. Thanks for providing this informative information…..
    uml training in chennai

    ReplyDelete
  17. Hi Unmesha,
    I have similar problem to be solved, but slightly more complicated.
    I have several file1s and file2s coming from different servers in productions.
    Example: file1.1.txt, file1.2.txt, file1.3.txt etc
    file2.1.txt, file2.2.txt, file2.3.txt
    I have structured data in file1 and file 2 with few common columns in file1 and file2s.
    My questions are
    1) How will you define your Driver class? (can u use regular expression or something?).
    2) What happens, if on a given Hadoop instance/node, you have data as follows
    file1.txt
    1 sri,kon
    2 sai,kon

    file2.txt
    2 'kg'
    3 'pg'

    How will the map reduce prohram work? The file2 data for a corresponding key value in file1, may be on a different node? Vice versa.

    Thanks
    Sri

    ReplyDelete
  18. We share it this blog was really amazing. This blog informative was really useful to me. Selenium Training in Chennai

    ReplyDelete
  19. I just see the post i am so happy to the communication science post of information's.So I have really enjoyed and reading your blogs for these posts.Any way I’ll be replay for your great thinks and I hope you post again soon...
    Software Testing Training in Chennai

    ReplyDelete
  20. A1 Trainings as one of the best training institute in Hyderabad for online trainings for Hadoop. We have expertise and real time professionals working in Hadoop since 7 years. Our training strategy and materials will help the students for the certification exams also.

    Hadoop Training in Hyderabad

    ReplyDelete
  21. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete