There are cases where we need to get 2 files as input and join them based on id or something like that.
Two different large data can be joined in map reduce programming also. Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join.
MapSide can be achieved using MultipleInputFormat in Hadoop.
Say I have 2 files ,One file with EmployeeID,Name,Designation and another file with EmployeeID,Salary,Department.
File1.txt
1 Anne,Admin
2 Gokul,Admin
3 Janet,Sales
4 Hari,Admin
AND
File2.txt
1 50000,A
2 50000,B
3 60000,A
4 50000,C
We will try to join these files into one based on EmployeeID
The result we aim at is
1 Anne,Admin,50000,A
2 Gokul,Admin,50000,B
3 Janet,Sales,60000,A
4 Hari,Admin,50000,C
Here in both file File1.txt,File2.txt we can see that we need to join the records based on id. So the employeeId's are common.
We will write 2 map jobs to process these files.
Processing File1.txt
The above map job process File1.txt
String[] words=line.split("\t");
splits each line with \t space so words[0] will be the employeeId which we pass it as key and the rest as value.
eg: 1 Anne,Admin
words[0] = 1
words[1] = Anne,Admin
Or else you can also use KeyValueTextInputFormat.class as InputFormat. This class gives key as employeeId and the rest as value.
You dont need to split it.
Processing File2.txt
The above map job process File2.txt
eg: 1 50000,A
words[0] = 1
words[1] = 50000,A
If the files are of same delimiter and ID comes first you can resuse the same map job
Lets write a commomn Reducer task to join the data using key.
Here we will be caching 1 data from a mapper and appends it to string "merge".
And emit employeeId as key and merge as value.
Now we need to furnish our Driver class to take 2 inputs and use MultipleInputFormat as InputFormat
p1,p2 are the Path variable holding 2 input files.
You can find the code in Github
Two different large data can be joined in map reduce programming also. Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join.
MapSide can be achieved using MultipleInputFormat in Hadoop.
Say I have 2 files ,One file with EmployeeID,Name,Designation and another file with EmployeeID,Salary,Department.
File1.txt
1 Anne,Admin
2 Gokul,Admin
3 Janet,Sales
4 Hari,Admin
AND
File2.txt
1 50000,A
2 50000,B
3 60000,A
4 50000,C
We will try to join these files into one based on EmployeeID
The result we aim at is
1 Anne,Admin,50000,A
2 Gokul,Admin,50000,B
3 Janet,Sales,60000,A
4 Hari,Admin,50000,C
Here in both file File1.txt,File2.txt we can see that we need to join the records based on id. So the employeeId's are common.
We will write 2 map jobs to process these files.
Processing File1.txt
public void map(LongWritable k, Text value, Context context) throws IOException, InterruptedException
{
String line=value.toString();
String[] words=line.split("\t");
keyEmit.set(words[0]);
valEmit.set(words[1]);
context.write(keyEmit, valEmit);
}
The above map job process File1.txt
String[] words=line.split("\t");
splits each line with \t space so words[0] will be the employeeId which we pass it as key and the rest as value.
eg: 1 Anne,Admin
words[0] = 1
words[1] = Anne,Admin
Or else you can also use KeyValueTextInputFormat.class as InputFormat. This class gives key as employeeId and the rest as value.
You dont need to split it.
Processing File2.txt
public void map(LongWritable k, Text v, Context context) throws IOException, InterruptedException
{
String line=v.toString();
String[] words=line.split(" ");
keyEmit.set(words[0]);
valEmit.set(words[1]);
context.write(keyEmit, valEmit);
}
The above map job process File2.txt
eg: 1 50000,A
words[0] = 1
words[1] = 50000,A
If the files are of same delimiter and ID comes first you can resuse the same map job
Lets write a commomn Reducer task to join the data using key.
String merge = "";
public void reduce(Text key, Iterable<Text> values, Context context)
{
int i =0;
for(Text value:values)
{
if(i == 0){
merge = value.toString()+",";
}
else{
merge += value.toString();
}
i++;
}
valEmit.set(merge);
context.write(key, valEmit);
}
Here we will be caching 1 data from a mapper and appends it to string "merge".
And emit employeeId as key and merge as value.
Now we need to furnish our Driver class to take 2 inputs and use MultipleInputFormat as InputFormat
public int run(String[] args) throws Exception { Configuration c=new Configuration(); String[] files=new GenericOptionsParser(c,args).getRemainingArgs(); Path p1=new Path(files[0]); Path p2=new Path(files[1]); Path p3=new Path(files[2]); FileSystem fs = FileSystem.get(c); if(fs.exists(p3)){ fs.delete(p3, true); } Job job = new Job(c,"Multiple Job"); job.setJarByClass(MultipleFiles.class); MultipleInputs.addInputPath(job, p1, TextInputFormat.class, MultipleMap1.class); MultipleInputs.addInputPath(job,p2, TextInputFormat.class, MultipleMap2.class); job.setReducerClass(MultipleReducer.class); . . }
MultipleInputs.addInputPath(job, p1, TextInputFormat.class, MultipleMap1.class);
MultipleInputs.addInputPath(job,p2, TextInputFormat.class, MultipleMap2.class);
You can find the code in Github
There is one more case where we can make our output in a sequential manner.
Say if we need to get the output as below
1 Anne,Admin,50000,A
2 Gokul,Admin,50000,B
3 Janet,Sales,60000,A
4 Hari,Admin,50000,C
Inorder to achieve the same we can make use of TextPair Writable concepts in Hadoop.
You can find the working code in github . Thanks to one of my blog reader Ravi Kumar who sorted out the sequence in order.
There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.
ReplyDeleteHadoop training chennai velachery
Hadoop training velachery
Hadoop course in t nagar
Build sophisticated applications leveraging the security of Amazon Web Services Cloud with the aid of our Amazon Web Services Certification Training Program. The AWS Training course preps you for the AWS Solution Architect Associate certification examinations. Master AWS Cloud Architecture, AWS EC2 services, AWS S3 services, AWS RDS and network service requirements and much more in this course. Join 360DigiTMG and enjoy the best AWS training in Hyderabad!.
Deletehttps://360digitmg.com/amazon-web-services-aws-training-in-hyderabad
At hadoop online training center we came to know about more other technologies like netezza, redshift, ETL and ELT with in depth insights. Similarly this website helped me a lot for the supplementary knowledge on those topics. Thanks for the info...
ReplyDeleteHi Thank you for the valuable inputs, I tried running the driver and I am getting output as below
ReplyDelete1 50000,A,Anne,Admin
2 Gokul,Admin,50000,B
3 60000,A,Janet,Sales
4 Hari,Admin,50000,C
but I need the sequential output i.e
1 Anne,Admin,50000,A
2 Gokul,Admin,50000,B
3 Janet,Sales,60000,A
4 Hari,Admin,50000,C
could you please help me with this
Hi Shaila
DeleteThanks for reading and trying out.
Inorder to acheive the order you can make use of Text Pair concepts in Hadoop.
Blog post is updated with working code.
could you specify about Text Pair please?
DeleteThis comment has been removed by the author.
DeleteI solved this problem by some dummy method... just add the table number before the output value in the Mapper class, like
Deletein your 1st mapper class:
outValue.set("1" + othCol.toString());
context.write(primaryKey,outValue);
in your 2nd mapper class:
outValue.set("2" + othCol.toString());
context.write(primaryKey,outValue);
in your reducer class:
StringBuilder stb1 = new StringBuilder();
StringBuilder stb2 = new StringBuilder();
for(Text t:valueIt){
if (t.toString().substring(0, 1).equals("1")){
stb1.append(t.toString().substring(1,t.getLength()));
}
else if(t.toString().substring(0, 1).equals("2")){
stb2.append(t.toString().substring(1,t.getLength()));
}
}
I was just wondering how I missed this article so far, this is a great piece of content I have ever seen in the entire Internet. Thanks for sharing this worth able information in here and do keep blogging like this.
ReplyDeleteHadoop Training Chennai | Hadoop Course in Chennai | Hadoop training institutes in chennai
Thanks Jhon
DeleteHi brother i have large 2 dataset all line (0,tcp,http,SF,335,10440,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,7,0.00,0.00,0.00,0.00,1.00,0.00,0.29,27,255,1.00,0.00,0.04,0.08,0.00,0.00,0.00,0.00,normal) should i create 42 keyemit to combine those 2 dataset?
ReplyDeleteYes,You will experience performance delay. You can try joining files using hive aswell
Deletethanks for replay, i keep your project and i just change in MultipleMap1 and MultipleMap2 number of keyemit (in my case should i create 42 keyemit) and it will join the 2 datasets?
DeleteArray Out of Bond when i try to run your example with large dataset, any solution please?
ReplyDeleteAre you using array to store values in Reducer?
DeleteNo i'm not using array i just used your code and i tried to run it with my dataset i got this error array out ofbond
DeleteMay I ask you for next task. File1.txt
ReplyDelete--------------
Admin,Anne
Admin,Gokul
Sales,Janet
--------------
File1.txt
--------------
Anne,100
Gokul,200
Janet,300
--------------
In third file we should aggregate per each Position their summary salary. In our case:
--------------
Admin,300
Sales,300
--------------
What is workflow in this case? To use MapReduce twise?(
not giving proper output
ReplyDeletehey bigboss...put tabspace in u r input files ...instead of directly copypasting the file. like below
ReplyDelete1 Anne,Admin
2 Gokul,Admin
3 Janet,Sales
4 Hari,Admin
Thanks for providing this informative information…..
ReplyDeleteYou may also refer-
http://www.s4techno.com/blog/category/hadoop/
Thanks for providing this informative information…..
ReplyDeleteuml training in chennai
Hi Unmesha,
ReplyDeleteI have similar problem to be solved, but slightly more complicated.
I have several file1s and file2s coming from different servers in productions.
Example: file1.1.txt, file1.2.txt, file1.3.txt etc
file2.1.txt, file2.2.txt, file2.3.txt
I have structured data in file1 and file 2 with few common columns in file1 and file2s.
My questions are
1) How will you define your Driver class? (can u use regular expression or something?).
2) What happens, if on a given Hadoop instance/node, you have data as follows
file1.txt
1 sri,kon
2 sai,kon
file2.txt
2 'kg'
3 'pg'
How will the map reduce prohram work? The file2 data for a corresponding key value in file1, may be on a different node? Vice versa.
Thanks
Sri
We share it this blog was really amazing. This blog informative was really useful to me. Selenium Training in Chennai
ReplyDeleteInformative. Do share this wonderful info all time
ReplyDeleteMBA in Business Analytics
online aptitude training
MBA in Marketing
Managment
MBA in Event
Managment
Learn core java online
A1 Trainings as one of the best training institute in Hyderabad for online trainings for Hadoop. We have expertise and real time professionals working in Hadoop since 7 years. Our training strategy and materials will help the students for the certification exams also.
ReplyDeleteHadoop Training in Hyderabad
hi ,this blog led me to learn information on joining two files in hadoop by mapreduce method thanks for your blog Hadoop Training in Velachery | Hadoop Training .
ReplyDeleteHadoop Training in Chennai | Hadoop .
nice article..Thank you for sharing such valuable infromation
ReplyDeleteHadoop training!
Hadoop training in Hyderabad
Hadoop training in USA
very informative blog and useful article thank you for sharing with us , keep posting Hadoop Admin Online Training
ReplyDeleteThank you a lot for providing individuals with a very spectacular possibility to read critical reviews from this site.
ReplyDeleteuipath training in bangalore
Those guidelines additionally worked to become a good way to recognize that other people online have the identical fervor like mine to grasp great deal more around this condition.
ReplyDeleteAmazon Web Services Training in Chennai
Best Java Training Institute Chennai
Thank you.Well it was nice post and very helpful information on Big data hadoop online training Hyderabad
ReplyDeleteNice blog and absolutely outstanding. You can do something much better but i still say this perfect.Keep trying for the best.
ReplyDeleteEmbedded training in chennai | Embedded training centre in chennai | Embedded system training in chennai | PLC Training institute in chennai | IEEE final year projects in chennai | VLSI training institute in chennai
Thank you for providing useful content
ReplyDeleteBig data hadoop online Course India
Really nice experience you have. Thank you for sharing. It will surely be an experience to someone.
ReplyDeletejava training in annanagar | java training in chennai
java training in marathahalli | java training in btm layout
java training in rajaji nagar | java training in jayanagar
Very good brief and this post helped me alot. Say thank you I searching for your facts. Thanks for sharing with us!
ReplyDeleteangularjs-Training in velachery
angularjs-Training in annanagar
angularjs Training in chennai
angularjs Training in chennai
Really nice experience you have. Thank you for sharing. It will surely be an experience to someone.
ReplyDeletepython training in chennai
python training in Bangalore
Python training institute in chennai
Great Article… I love to read your articles because your writing style is too good, its is very very helpful for all of us and I never get bored while reading your article because, they are becomes a more and more interesting from the starting lines until the end.
ReplyDeleteDevops Training in pune
DevOps online Training
Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging.
ReplyDeleteData Science training in Chennai | Data science training in bangalore
Data science training in pune | Data science online training
Data Science Interview questions and answers
I found this informative and interesting blog so i think so its very useful and knowledge able.I would like to thank you for the efforts you have made in writing this article.
ReplyDeleteJava training in Chennai | Java training in Tambaram | Java training in Chennai | Java training in Velachery
Java training in Chennai | Java training in Omr | Oracle training in Chennai
Really very nice blog information for this one and more technical skills are improve,i like that kind of post.
ReplyDeleteBest Devops Training in pune
Thank you for benefiting from time to focus on this kind of, I feel firmly about it and also really like comprehending far more with this particular subject matter. In case doable, when you get know-how, is it possible to thoughts modernizing your site together with far more details? It’s extremely useful to me.
ReplyDeletepython training in chennai | python course institute in chennai
Thank you for taking the time and sharing this information with us. It was indeed very helpful and insightful while being straight forward and to the point.
ReplyDeleteJava training in Bangalore | Java training in Marathahalli
Java training in Bangalore | Java training in Btm layout
Java training in Bangalore |Java training in Rajaji nagar
Thanks for sharing this interesting blog with us.My pleasure to being here on your blog..I wanna come beck here for new post from your site.
ReplyDeleteSelenium Training in Chennai
selenium testing training in chennai
iOS Training in Chennai
French Classes in Chennai
Big Data Training in Chennai
SEO Training in Chennai
SEO Training
Great article. Thanks for sharing.
ReplyDeleteArticle submission sites
Technology
This is ansuperior writing service point that doesn't always sink in within the context of the classroom. In the first superior writing service paragraph you either hook the reader's interest or lose it. Of course your teacher, who's getting paid to teach you how to write an good essay,
ReplyDeleteData Science Training in Chennai | Data Science Training institute in Chennai
Data Science course in anna nagar
Data Science course in chennai | Data Science Training institute in Chennai | Best Data Science Training in Chennai
Data science course in Bangalore | Data Science Training institute in Bangalore | Best Data Science Training in Bangalore
Data Science course in marathahalli | Data Science training in Bangalore
Some us know all relating to the compelling medium you present powerful steps on this blog and therefore strongly encourage
ReplyDeletecontribution from other ones on this subject while our own child is truly discovering a great deal.
Have fun with the remaining portion of the year.
Selenium training in bangalore | best selenium training in bangalore
ReplyDeleteExcellent Blog. Thank you so much for sharing.
hadoop interview questions
Hadoop interview questions for experienced
Hadoop interview questions for freshers
top 100 hadoop interview questions
frequently asked hadoop interview questions
hadoop interview questions and answers for freshers
hadoop interview questions and answers pdf
hadoop interview questions and answers
hadoop interview questions and answers for experienced
hadoop interview questions and answers for testers
hadoop interview questions and answers pdf download
hadoop interview questions pdf
Being new to the blogging world I feel like there is still so much to learn. Your tips helped to clarify a few things for me as well as giving.sap s4 hana simple finance training in bangalore
ReplyDeleteWe as a team of real-time industrial experience with a lot of knowledge in developing applications in python programming (7+ years) will ensure that we will deliver our best in python training in vijayawada. , and we believe that no one matches us in this context.
ReplyDeleteThanks for Posting such an useful and nice info...
ReplyDeleteSalesforce Training
nice article..Thank you for sharing such valuable infromation
ReplyDeleteData-science training in chennai
Effective blog with a lot of information. I just Shared you the link below for ACTE .They really provide good level of training and Placement,I just Had Hadoop Classes in ACTE , Just Check This Link You can get it more information about the Hadoop course.
ReplyDeleteJava training in chennai | Java training in annanagar | Java training in omr | Java training in porur | Java training in tambaram | Java training in velachery
Machine Learning relies heavily on the available data. Therefore, they have a strong relationship with each other. So, we can say that both the terms are related. best machine learning course in hyderabad
ReplyDeleteVery nice article,keep sharing more articles with us.
ReplyDeletethank you...
hadoop admin certificationCourse|hadoop admin course|hadoop admin online training
Very informative blog and useful article thank you for sharing with us, keep posting learn more.
ReplyDeleteBy Cognex
Cognex offers AWS Training and certification in chennai
Great Post, thanks for sharing such an amazing blog with us. Visit Ogen Infosystem for creative website design and PPC Services in Delhi, India.
ReplyDeletePPC Company in Delhi
Great post I would like to thank you for the efforts you have made in writing this interesting and knowledgeable article.
ReplyDeleteBest Institute for Data Science in Hyderabad
Oh waoh! Nice Post, Thanks for sharing it, Visit
ReplyDeleteWebocity is best website designing company in delhi , Best Website development company in Delhi, We Offer Best Digital Marketing services in Delhi.
Very awesome!!! When I searched for this I found this website at the top of all blogs in search engines.
ReplyDeleteBest Institute for Data Science in Hyderabad
The knowledge of technology you have been sharing thorough this post is very much helpful to develop new idea. here by i also want to share this.
ReplyDeleteAWS Training in Hyderabad
AWS Course in Hyderabad
salesforce training
ReplyDeletehadoop training
Data Science training
Such a very useful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article.
ReplyDeletedata science training
I have been searching to find a comfort or effective procedure to complete this process and I think this is the most suitable way to do it effectively.
ReplyDeletedata scientist training in malaysia
The Original Forex Trading System: tradeatf Is The Original Forex Trading System. It Is 100% Automated And Provides An Easy-to-follow Trading System. You Get Access To Real-time Signals, Proven Methods, And A Money-back Guarantee.
ReplyDeleteWe Are A Devoted Team Of Forex Traders And Reviews Site Who Aim To Provide Unbiased And Detailed Reviews On All Major Forex Brokers Out There. At Iforexs We Have Spent The Last Few Years Researching, Reviewing, And Testing Forex Brokers So That We Can Help You Find One That Suits Your Trading Needs. With The Ever-increasing Number Of Forex Brokers Out There, We Understand It Is Hard To Know Where To Put Your Money – But We’re Here To Help!
ReplyDeleteHi, Thanks for sharing nice information...
ReplyDeleteCA in Delhi NCR
We SVJ Technocoat are the leading Service Provider and Exporter of an extensive array of PVD Coating Service and Vapor Deposition Coating Service etc. We are a well known firm for providing excellent quality coating services across the nation and in a timely manner. Owing to our improvised business models, our professionals are offering integrated solutions for our clients.
ReplyDeleteIn search of the Best MT4 Forex Brokers In Netherland ? The list we provide here is not exhaustive. We have chosen 5 of the best Forex brokers available in Netherland so you can compare them side by side.
ReplyDeleteGet a comprehensive overview of Data Science and learn all the essential skills including collecting, modeling, and interpreting data. Register with Data Science institute Bangalore and build a strong foundation for a career where you will be involved in uncovering valuable information for your organization. Learn Python, Machine Learning, Big Data, Deep Learning, and Analytics to take center stage in Data Science.
ReplyDeleteData Science Course in Bangalore with Placement
Learn to use analytics tools and techniques to manage and analyze large sets of data from Data Science training institutes in Bangalore. Learn to take on business challenges and solve problems by uncovering valuable insights from data. Learn from the comprehensively designed curriculum by the industry experts and work on live projects to sharpen your skills.
ReplyDeleteData Science Course in Delhi
Advance your technical skills required to crack huge datasets to bring out new possibilities from data. Join the Data Science institutes in Bangalore and get access to top industry trainers, LMS, live projects, assignments, and mock interviews to skyrocket your career in the ever- evolving field of Data Science.
ReplyDeleteData Scientist Course in Delhi
This comment has been removed by the author.
ReplyDeleteFantastic article, thanks for sharing your knowledge with us
ReplyDeleteJava training institution in Hyderabad
Thank you for sharing this Blog, This is more informative and nice.
ReplyDeletePython full Stack Developer Training in Hyderabad
Unified communications and Ip Pbx includes the connection of various communication systems both for the collaboration tools as the digital workforce.
ReplyDeleteThis is nice and more informative, Thank you for sharing it.
ReplyDeleteAzure DevOps Training in Hyderabad
ReplyDeleteElevate your expertise in handling massive datasets to unlock innovative insights from data. Enroll in leading Data Science institutes in Bangalore, where you'll gain exposure to industry-leading instructors, a state-of-the-art Learning Management System (LMS), hands-on live projects, challenging assignments, and simulated interviews. Propel your career in the dynamic realm of Data Science with these comprehensive resources.
Java Full Stack Developer Course In Marathahalli
Nice Blog react js training in marathahalli,
ReplyDelete