Here is Something !: Joining Two Files Using MultipleInput In Hadoop MapReduce

Sunday, 7 December 2014

Joining Two Files Using MultipleInput In Hadoop MapReduce - MapSide Join

There are cases where we need to get 2 files as input and join them based on id or something like that.
Two different large data can be joined in map reduce programming also. Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join.
MapSide can be achieved using MultipleInputFormat in Hadoop.

Say I have 2 files ,One file with EmployeeID,Name,Designation and another file with EmployeeID,Salary,Department.

File1.txt
1 Anne,Admin
2 Gokul,Admin
3 Janet,Sales
4 Hari,Admin

AND

File2.txt
1 50000,A
2 50000,B
3 60000,A
4 50000,C

We will try to join these files into one based on EmployeeID
The result we aim at is

1 Anne,Admin,50000,A
2 Gokul,Admin,50000,B
3 Janet,Sales,60000,A
4 Hari,Admin,50000,C

Here in both file File1.txt,File2.txt we can see that we need to join the records based on id. So the employeeId's are common.
We will write 2 map jobs to process these files.

Processing File1.txt

public void map(LongWritable k, Text value, Context context) throws IOException, InterruptedException
{
 String line=value.toString();
 String[] words=line.split("\t");
 keyEmit.set(words[0]);
 valEmit.set(words[1]);
 context.write(keyEmit, valEmit);
}

The above map job process File1.txt
String[] words=line.split("\t");
splits each line with \t space so words[0] will be the employeeId which we pass it as key and the rest as value.

eg: 1 Anne,Admin
words[0] = 1
words[1] = Anne,Admin

Or else you can also use KeyValueTextInputFormat.class as InputFormat. This class gives key as employeeId and the rest as value.
You dont need to split it.

Processing File2.txt

public void map(LongWritable k, Text v, Context context) throws IOException, InterruptedException
{
 String line=v.toString();
 String[] words=line.split(" ");
 keyEmit.set(words[0]);
 valEmit.set(words[1]);
 context.write(keyEmit, valEmit);
}

The above map job process File2.txt

eg: 1 50000,A
words[0] = 1
words[1] = 50000,A

If the files are of same delimiter and ID comes first you can resuse the same map job

Lets write a commomn Reducer task to join the data using key.

String merge = "";
public void reduce(Text key, Iterable<Text> values, Context context)
{
 int i =0;
 for(Text value:values)
 {
  if(i == 0){
   merge = value.toString()+",";
  }
  else{
   merge += value.toString();
  }
  i++;
 }
 valEmit.set(merge);
 context.write(key, valEmit);
}

Here we will be caching 1 data from a mapper and appends it to string "merge".
And emit employeeId as key and merge as value.

Now we need to furnish our Driver class to take 2 inputs and use MultipleInputFormat as InputFormat

public int run(String[] args) throws Exception {
 Configuration c=new Configuration();
 String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
 Path p1=new Path(files[0]);
 Path p2=new Path(files[1]);
 Path p3=new Path(files[2]);
 FileSystem fs = FileSystem.get(c);
 if(fs.exists(p3)){
  fs.delete(p3, true);
  }
 Job job = new Job(c,"Multiple Job");
 job.setJarByClass(MultipleFiles.class);
 MultipleInputs.addInputPath(job, p1, TextInputFormat.class, MultipleMap1.class);
 MultipleInputs.addInputPath(job,p2, TextInputFormat.class, MultipleMap2.class);
 job.setReducerClass(MultipleReducer.class);
 .
 .
}

MultipleInputs.addInputPath(job, p1, TextInputFormat.class, MultipleMap1.class);
MultipleInputs.addInputPath(job,p2, TextInputFormat.class, MultipleMap2.class);

p1,p2 are the Path variable holding 2 input files.
You can find the code in Github

There is one more case where we can make our output in a sequential manner.

Say if we need to get the output as below

1 Anne,Admin,50000,A
2 Gokul,Admin,50000,B
3 Janet,Sales,60000,A
4 Hari,Admin,50000,C

Inorder to achieve the same we can make use of TextPair Writable concepts in Hadoop.

You can find the working code in github . Thanks to one of my blog reader Ravi Kumar who sorted out the sequence in order.

79 comments:

Anonymous26 October 2015 at 00:36
At hadoop online training center we came to know about more other technologies like netezza, redshift, ETL and ELT with in depth insights. Similarly this website helped me a lot for the supplementary knowledge on those topics. Thanks for the info...
ReplyDelete
Replies
Unknown7 December 2015 at 10:56
Hi Thank you for the valuable inputs, I tried running the driver and I am getting output as below

1 50000,A,Anne,Admin
2 Gokul,Admin,50000,B
3 60000,A,Janet,Sales
4 Hari,Admin,50000,C

but I need the sequential output i.e

1 Anne,Admin,50000,A
2 Gokul,Admin,50000,B
3 Janet,Sales,60000,A
4 Hari,Admin,50000,C

could you please help me with this

ReplyDelete
Replies
Unknown21 December 2015 at 00:39
I was just wondering how I missed this article so far, this is a great piece of content I have ever seen in the entire Internet. Thanks for sharing this worth able information in here and do keep blogging like this.

Hadoop Training Chennai | Hadoop Course in Chennai | Hadoop training institutes in chennai
ReplyDelete
Replies
Unknown21 February 2016 at 11:48
Hi brother i have large 2 dataset all line (0,tcp,http,SF,335,10440,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,7,0.00,0.00,0.00,0.00,1.00,0.00,0.29,27,255,1.00,0.00,0.04,0.08,0.00,0.00,0.00,0.00,normal) should i create 42 keyemit to combine those 2 dataset?
ReplyDelete
Replies
Unknown25 February 2016 at 06:16
Array Out of Bond when i try to run your example with large dataset, any solution please?
ReplyDelete
Replies
Unknown16 April 2016 at 11:28
May I ask you for next task. File1.txt
--------------
Admin,Anne
Admin,Gokul
Sales,Janet
--------------
File1.txt
--------------
Anne,100
Gokul,200
Janet,300
--------------
In third file we should aggregate per each Position their summary salary. In our case:
--------------
Admin,300
Sales,300
--------------
What is workflow in this case? To use MapReduce twise?(
ReplyDelete
Replies
Unknown25 May 2016 at 19:13
not giving proper output
ReplyDelete
Replies
Unknown25 May 2016 at 19:15
hey bigboss...put tabspace in u r input files ...instead of directly copypasting the file. like below
1 Anne,Admin
2 Gokul,Admin
3 Janet,Sales
4 Hari,Admin
ReplyDelete
Replies
Unknown19 October 2016 at 02:48
Thanks for providing this informative information…..
You may also refer-
http://www.s4techno.com/blog/category/hadoop/
ReplyDelete
Replies
Unknown1 November 2016 at 22:35
Thanks for providing this informative information…..
uml training in chennai
ReplyDelete
Replies
meltyourfat9 December 2016 at 10:35
Hi Unmesha,
I have similar problem to be solved, but slightly more complicated.
I have several file1s and file2s coming from different servers in productions.
Example: file1.1.txt, file1.2.txt, file1.3.txt etc
file2.1.txt, file2.2.txt, file2.3.txt
I have structured data in file1 and file 2 with few common columns in file1 and file2s.
My questions are
1) How will you define your Driver class? (can u use regular expression or something?).
2) What happens, if on a given Hadoop instance/node, you have data as follows
file1.txt
1 sri,kon
2 sai,kon

file2.txt
2 'kg'
3 'pg'

How will the map reduce prohram work? The file2 data for a corresponding key value in file1, may be on a different node? Vice versa.

Thanks
Sri
ReplyDelete
Replies
vignesjose21 February 2017 at 02:13
We share it this blog was really amazing. This blog informative was really useful to me. Selenium Training in Chennai
ReplyDelete
Replies
Anonymous15 March 2017 at 01:40
Informative. Do share this wonderful info all time
MBA in Business Analytics

online aptitude training

MBA in Marketing

Managment

MBA in Event

Managment

Learn core java online

ReplyDelete
Replies
Sam Reddy30 April 2017 at 21:37
A1 Trainings as one of the best training institute in Hyderabad for online trainings for Hadoop. We have expertise and real time professionals working in Hadoop since 7 years. Our training strategy and materials will help the students for the certification exams also.

Hadoop Training in Hyderabad
ReplyDelete
Replies
Unknown12 April 2018 at 23:47
hi ,this blog led me to learn information on joining two files in hadoop by mapreduce method thanks for your blog Hadoop Training in Velachery | Hadoop Training .
Hadoop Training in Chennai | Hadoop .
ReplyDelete
Replies
Rstrainings19 April 2018 at 00:36
nice article..Thank you for sharing such valuable infromation
Hadoop training!
Hadoop training in Hyderabad
Hadoop training in USA

ReplyDelete
Replies
Tejuteju19 April 2018 at 02:02
very informative blog and useful article thank you for sharing with us , keep posting Hadoop Admin Online Training
ReplyDelete
Replies
saranya26 April 2018 at 23:47
Thank you a lot for providing individuals with a very spectacular possibility to read critical reviews from this site.
uipath training in bangalore
ReplyDelete
Replies
kevingeorge3 May 2018 at 01:48
Those guidelines additionally worked to become a good way to recognize that other people online have the identical fervor like mine to grasp great deal more around this condition.

Amazon Web Services Training in Chennai

Best Java Training Institute Chennai

ReplyDelete
Replies
Tejuteju7 June 2018 at 04:15
Thank you.Well it was nice post and very helpful information on Big data hadoop online training Hyderabad
ReplyDelete
Replies
Unknown13 June 2018 at 03:18
Nice blog and absolutely outstanding. You can do something much better but i still say this perfect.Keep trying for the best.

Embedded training in chennai | Embedded training centre in chennai | Embedded system training in chennai | PLC Training institute in chennai | IEEE final year projects in chennai | VLSI training institute in chennai

ReplyDelete
Replies
Tejuteju21 June 2018 at 21:30
Thank you for providing useful content
Big data hadoop online Course India
ReplyDelete
Replies
simbu7 September 2018 at 00:49
Really nice experience you have. Thank you for sharing. It will surely be an experience to someone.
java training in annanagar | java training in chennai

java training in marathahalli | java training in btm layout

java training in rajaji nagar | java training in jayanagar

ReplyDelete
Replies
ganga pragya12 September 2018 at 05:27
Very good brief and this post helped me alot. Say thank you I searching for your facts. Thanks for sharing with us!

angularjs-Training in velachery

angularjs-Training in annanagar

angularjs Training in chennai

angularjs Training in chennai
ReplyDelete
Replies
Mounika3 October 2018 at 22:47
Really nice experience you have. Thank you for sharing. It will surely be an experience to someone.
python training in chennai
python training in Bangalore
Python training institute in chennai
ReplyDelete
Replies
shethal6 October 2018 at 03:42
Great Article… I love to read your articles because your writing style is too good, its is very very helpful for all of us and I never get bored while reading your article because, they are becomes a more and more interesting from the starting lines until the end.

Devops Training in pune
DevOps online Training
ReplyDelete
Replies
nivatha25 October 2018 at 23:52
Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging.

Data Science training in Chennai | Data science training in bangalore

Data science training in pune | Data science online training

Data Science Interview questions and answers

ReplyDelete
Replies
Unknown29 October 2018 at 00:36
I found this informative and interesting blog so i think so its very useful and knowledge able.I would like to thank you for the efforts you have made in writing this article.
Java training in Chennai | Java training in Tambaram | Java training in Chennai | Java training in Velachery

Java training in Chennai | Java training in Omr | Oracle training in Chennai
ReplyDelete
Replies
Unknown29 October 2018 at 00:44
Really very nice blog information for this one and more technical skills are improve,i like that kind of post.
Best Devops Training in pune
ReplyDelete
Replies
Unknown29 October 2018 at 02:11
Thank you for benefiting from time to focus on this kind of, I feel firmly about it and also really like comprehending far more with this particular subject matter. In case doable, when you get know-how, is it possible to thoughts modernizing your site together with far more details? It’s extremely useful to me.
python training in chennai | python course institute in chennai
ReplyDelete
Replies
janani1 December 2018 at 00:59
Thank you for taking the time and sharing this information with us. It was indeed very helpful and insightful while being straight forward and to the point.
Java training in Bangalore | Java training in Marathahalli

Java training in Bangalore | Java training in Btm layout

Java training in Bangalore |Java training in Rajaji nagar
ReplyDelete
Replies
Sharmi Ammu1 December 2018 at 03:56
Thanks for sharing this interesting blog with us.My pleasure to being here on your blog..I wanna come beck here for new post from your site.

Selenium Training in Chennai
selenium testing training in chennai
iOS Training in Chennai
French Classes in Chennai
Big Data Training in Chennai
SEO Training in Chennai
SEO Training
ReplyDelete
Replies
Vicky Ram5 December 2018 at 00:08
Great article. Thanks for sharing.

Article submission sites
Technology
ReplyDelete
Replies
Unknown6 December 2018 at 22:27
This is ansuperior writing service point that doesn't always sink in within the context of the classroom. In the first superior writing service paragraph you either hook the reader's interest or lose it. Of course your teacher, who's getting paid to teach you how to write an good essay,
Data Science Training in Chennai | Data Science Training institute in Chennai
Data Science course in anna nagar
Data Science course in chennai | Data Science Training institute in Chennai | Best Data Science Training in Chennai
Data science course in Bangalore | Data Science Training institute in Bangalore | Best Data Science Training in Bangalore
Data Science course in marathahalli | Data Science training in Bangalore
ReplyDelete
Replies
tamilsasi12 January 2019 at 01:29
Some us know all relating to the compelling medium you present powerful steps on this blog and therefore strongly encourage
contribution from other ones on this subject while our own child is truly discovering a great deal.
Have fun with the remaining portion of the year.

Selenium training in bangalore | best selenium training in bangalore
ReplyDelete
Replies
Venkatesh CS16 July 2019 at 06:12

Excellent Blog. Thank you so much for sharing.
hadoop interview questions
Hadoop interview questions for experienced
Hadoop interview questions for freshers
top 100 hadoop interview questions
frequently asked hadoop interview questions
hadoop interview questions and answers for freshers
hadoop interview questions and answers pdf
hadoop interview questions and answers
hadoop interview questions and answers for experienced
hadoop interview questions and answers for testers
hadoop interview questions and answers pdf download
hadoop interview questions pdf
ReplyDelete
Replies
Training for IT and Software Courses1 December 2019 at 22:53
Being new to the blogging world I feel like there is still so much to learn. Your tips helped to clarify a few things for me as well as giving.sap s4 hana simple finance training in bangalore
ReplyDelete
Replies
python training in vijayawada21 December 2019 at 01:01
We as a team of real-time industrial experience with a lot of knowledge in developing applications in python programming (7+ years) will ensure that we will deliver our best in python training in vijayawada. , and we believe that no one matches us in this context.
ReplyDelete
Replies
svrtechnologies6 January 2020 at 01:02
Thanks for Posting such an useful and nice info...

Salesforce Training
ReplyDelete
Replies
Unknown2 March 2020 at 23:55
Build sophisticated applications leveraging the security of Amazon Web Services Cloud with the aid of our Amazon Web Services Certification Training Program. The AWS Training course preps you for the AWS Solution Architect Associate certification examinations. Master AWS Cloud Architecture, AWS EC2 services, AWS S3 services, AWS RDS and network service requirements and much more in this course. Join 360DigiTMG and enjoy the best AWS training in Hyderabad!.
https://360digitmg.com/amazon-web-services-aws-training-in-hyderabad
ReplyDelete
Replies
renuka23 March 2020 at 03:10
nice article..Thank you for sharing such valuable infromation
Data-science training in chennai
ReplyDelete
Replies
Joyal4 May 2020 at 02:07
Effective blog with a lot of information. I just Shared you the link below for ACTE .They really provide good level of training and Placement,I just Had Hadoop Classes in ACTE , Just Check This Link You can get it more information about the Hadoop course.

Java training in chennai | Java training in annanagar | Java training in omr | Java training in porur | Java training in tambaram | Java training in velachery
ReplyDelete
Replies
kenwood13 July 2020 at 12:32
Machine Learning relies heavily on the available data. Therefore, they have a strong relationship with each other. So, we can say that both the terms are related. best machine learning course in hyderabad
ReplyDelete
Replies
veera cynixit9 September 2020 at 00:35
Very nice article,keep sharing more articles with us.
thank you...
hadoop admin certificationCourse|hadoop admin course|hadoop admin online training
ReplyDelete
Replies
Cognex Technology11 September 2020 at 05:32
Very informative blog and useful article thank you for sharing with us, keep posting learn more.
By Cognex
Cognex offers AWS Training and certification in chennai
ReplyDelete
Replies
OGEN Infosystem (P) Limited11 November 2020 at 23:08
Great Post, thanks for sharing such an amazing blog with us. Visit Ogen Infosystem for creative website design and PPC Services in Delhi, India.
PPC Company in Delhi
ReplyDelete
Replies
360digiTMG Training3 December 2020 at 23:12
Great post I would like to thank you for the efforts you have made in writing this interesting and knowledgeable article.
Best Institute for Data Science in Hyderabad

ReplyDelete
Replies
Shayari Night4 December 2020 at 22:24
Oh waoh! Nice Post, Thanks for sharing it, Visit
Webocity is best website designing company in delhi , Best Website development company in Delhi, We Offer Best Digital Marketing services in Delhi.
ReplyDelete
Replies
360digiTMG Training16 March 2021 at 23:04
Very awesome!!! When I searched for this I found this website at the top of all blogs in search engines.

Best Institute for Data Science in Hyderabad

ReplyDelete
Replies
Priya Rathod19 July 2021 at 02:34
The knowledge of technology you have been sharing thorough this post is very much helpful to develop new idea. here by i also want to share this.
AWS Training in Hyderabad
AWS Course in Hyderabad
ReplyDelete
Replies
KITS Technologies22 July 2021 at 06:29
salesforce training
hadoop training
Data Science training
ReplyDelete
Replies
training institute23 August 2021 at 22:25
Such a very useful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article.
data science training
ReplyDelete
Replies
traininginstitute23 September 2021 at 19:41
I have been searching to find a comfort or effective procedure to complete this process and I think this is the most suitable way to do it effectively.
data scientist training in malaysia
ReplyDelete
Replies
Juliana petar29 September 2021 at 03:47
The Original Forex Trading System: tradeatf Is The Original Forex Trading System. It Is 100% Automated And Provides An Easy-to-follow Trading System. You Get Access To Real-time Signals, Proven Methods, And A Money-back Guarantee.
ReplyDelete
Replies
Oliver12 January 2022 at 20:51
We Are A Devoted Team Of Forex Traders And Reviews Site Who Aim To Provide Unbiased And Detailed Reviews On All Major Forex Brokers Out There. At Iforexs We Have Spent The Last Few Years Researching, Reviewing, And Testing Forex Brokers So That We Can Help You Find One That Suits Your Trading Needs. With The Ever-increasing Number Of Forex Brokers Out There, We Understand It Is Hard To Know Where To Put Your Money – But We’re Here To Help!
ReplyDelete
Replies
AWS Training in Hyderabad11 February 2022 at 06:08
Hi, Thanks for sharing nice information...

CA in Delhi NCR
ReplyDelete
Replies
SVJ Technocoat7 March 2022 at 02:22
We SVJ Technocoat are the leading Service Provider and Exporter of an extensive array of PVD Coating Service and Vapor Deposition Coating Service etc. We are a well known firm for providing excellent quality coating services across the nation and in a timely manner. Owing to our improvised business models, our professionals are offering integrated solutions for our clients.
ReplyDelete
Replies
Peter Schiff27 March 2022 at 23:28
In search of the Best MT4 Forex Brokers In Netherland ? The list we provide here is not exhaustive. We have chosen 5 of the best Forex brokers available in Netherland so you can compare them side by side.
ReplyDelete
Replies
Professional Career Technology26 May 2022 at 11:19
Get a comprehensive overview of Data Science and learn all the essential skills including collecting, modeling, and interpreting data. Register with Data Science institute Bangalore and build a strong foundation for a career where you will be involved in uncovering valuable information for your organization. Learn Python, Machine Learning, Big Data, Deep Learning, and Analytics to take center stage in Data Science.

Data Science Course in Bangalore with Placement
ReplyDelete
Replies
Educational Courses 3 June 2022 at 08:21
Learn to use analytics tools and techniques to manage and analyze large sets of data from Data Science training institutes in Bangalore. Learn to take on business challenges and solve problems by uncovering valuable insights from data. Learn from the comprehensively designed curriculum by the industry experts and work on live projects to sharpen your skills.

Data Science Course in Delhi
ReplyDelete
Replies
Professional Courses and Training3 June 2022 at 08:45
Advance your technical skills required to crack huge datasets to bring out new possibilities from data. Join the Data Science institutes in Bangalore and get access to top industry trainers, LMS, live projects, assignments, and mock interviews to skyrocket your career in the ever- evolving field of Data Science.

Data Scientist Course in Delhi

ReplyDelete
Replies
Laxmi1 November 2022 at 23:19
This comment has been removed by the author.
ReplyDelete
Replies
TRONIX23 February 2023 at 17:31
Fantastic article, thanks for sharing your knowledge with us

Java training institution in Hyderabad
ReplyDelete
Replies
saikumar08014 April 2023 at 06:47
Thank you for sharing this Blog, This is more informative and nice.

Python full Stack Developer Training in Hyderabad
ReplyDelete
Replies
Interface Digital Solutions12 August 2023 at 02:08
Unified communications and Ip Pbx includes the connection of various communication systems both for the collaboration tools as the digital workforce.
ReplyDelete
Replies
Digital Learning4 September 2023 at 00:24
This is nice and more informative, Thank you for sharing it.

Azure DevOps Training in Hyderabad
ReplyDelete
Replies
Java Full Stack9 November 2023 at 22:48

Elevate your expertise in handling massive datasets to unlock innovative insights from data. Enroll in leading Data Science institutes in Bangalore, where you'll gain exposure to industry-leading instructors, a state-of-the-art Learning Management System (LMS), hands-on live projects, challenging assignments, and simulated interviews. Propel your career in the dynamic realm of Data Science with these comprehensive resources.

Java Full Stack Developer Course In Marathahalli
ReplyDelete
Replies
mohan kumar 18 October 2024 at 01:29
Nice Blog react js training in marathahalli,
ReplyDelete
Replies
mern9 April 2025 at 01:28
anonymous

mern stack course in bangalore,
mern stack developer course in bangalore,
mern stack training in bangalore
ReplyDelete
Replies
mern11 April 2025 at 02:33
good block

mern stack course in bangalore,
mern stack developer course in bangalore,
mern stack training in bangalore
ReplyDelete
Replies