Here is Something !: Partitioning Data Using Hadoop MultipleOutputs

Wednesday, 9 December 2015

Partitioning Data Using Hadoop MultipleOutputs

There may be cases where we need to partition our data based on certion condition.

Say for example, Consider this Employee data

EmpId,EmpName,Age,Gender,Salary
1201,gopal,45,Male,50000
1202,manisha,40,Female,51000
1203,khaleel,34,Male,30000
1204,prasanth,30,Male,31000
1205,kiran,20,Male,40000
1206,laxmi,25,Female,35000
1207,bhavya,20,Female,15000
1208,reshma,19,Female,14000
1209,kranthi,22,Male,22000
1210,Satish,24,Male,25000
1211,Krishna,25,Male,26000
1212,Arshad,28,Male,20000
1213,lavanya,18,Female,8000

Lets assume one condition.

We need to seperate above data based on Gender (there can be more scenarios)

Expected outcome will be like this

Female

1213,lavanya,18,Female,8000
1202,manisha,40,Female,51000
1206,laxmi,25,Female,35000
1207,bhavya,20,Female,15000
1208,reshma,19,Female,14000

Male

1211,Krishna,25,Male,26000
1212,Arshad,28,Male,20000
1201,gopal,45,Male,50000
1209,kranthi,22,Male,22000
1210,Satish,24,Male,25000
1203,khaleel,34,Male,30000
1204,prasanth,30,Male,31000
1205,kiran,20,Male,40000

This can be achieved by using MultipleOutputs in Hadoop.

The name itself gives an idea on what MultipleOutputs is - writes output data to multiple outputs

Lets see how to implement this

Driver Class

public class PartitionerDriver extends Configured implements Tool {

 /**
  * @param args
  * @throws Exception
  */
 public static void main(String[] args) {
  // TODO Auto-generated method stub
  Configuration conf = new Configuration();
  try {
   int res = ToolRunner.run(conf, new PartitionerDriver(), args);
  } catch (Exception e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
 }

 public int run(String[] args) {
  // TODO Auto-generated method stub
  System.out.println("Partitioning File Based on Gender...........");
  if (args.length != 3) {
   System.err
     .println("Usage:File Partitioner <input> <output> <delimiter> ");
   System.exit(0);
  }
  Configuration conf = new Configuration();
  /*
   * Arguments
   */
  String source = args[0];
  String dest = args[1];
  String delimiter = args[2];
  
  //conf objects
  conf.set("delimiter", delimiter);
  
  FileSystem fs = null;
  try {
   fs = FileSystem.get(conf);
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  
  Path in = new Path(source);
  Path out = new Path(dest);
  
  Job job0 = null;
  try {
   job0 = new Job(conf, "Partition Records");
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  job0.setJarByClass(PartitionerDriver.class);
  job0.setMapperClass(PartitionMapper.class);
  job0.setReducerClass(PartitionReducer.class);
  job0.setMapOutputKeyClass(Text.class);
  job0.setMapOutputValueClass(Text.class);
  job0.setOutputKeyClass(Text.class);
  job0.setOutputValueClass(Text.class);
  try {
   TextInputFormat.addInputPath(job0, in);
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  /*
   * Delete output dir if exist
   */
  try {
   if (fs.exists(out)) {
    fs.delete(out, true);
   }
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  
  TextOutputFormat.setOutputPath(job0, out);
  try {
   job0.waitForCompletion(true);
  } catch (ClassNotFoundException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  } catch (InterruptedException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  
  System.out.println("Successfully partitioned Data based on Gender!");
  return 0;
 }
}

Mapper Class

Mapper class gets each record and split them using delimiter.

Key will be Gender and value will be the record

protected void map(LongWritable key, Text value, Context context) {

  Configuration conf = context.getConfiguration();
  String delim = conf.get("delimiter");
  String line = value.toString();
  
  String[] record = line.split(delim);
  keyEmit.set(record[3]);
   try {
    context.write(keyEmit, value);
   } catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
   } catch (InterruptedException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
  }
 }

Reducer Class

MultipleOutputs<NullWritable, Text> mos;
 NullWritable out = NullWritable.get();

 @Override
 protected void setup(Context context) {
  mos = new MultipleOutputs(context);
 }

 public void reduce(Text key, Iterable<Text> values, Context context) {
  for (Text value : values) {
   try {
    mos.write(out, value, key.toString());
   } catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
   } catch (InterruptedException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
   }
  }
 }

 @Override
 protected void cleanup(Context context) {
  try {
   mos.close();
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  } catch (InterruptedException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
 }

Here we define MultipleOutput as

MultipleOutputs<NullWritable, Text> mos;

Our file doesnot need a key.We are only interested to get the data. So key will be NullWritable and Value will be Text.

We have a setup() method where we initialize out MultipleOutput.

mos = new MultipleOutputs(context);

Lets see reduce()

As we know reducer aggregates values based on key. Key from mapper was Gender and we have 2 genders Male and female . so we will recieve 2 keys follwed by its value.

for (Text value : values) {
 mos.write(out, value, key.toString());
}

In MultipleOutputs we have 3 arguments

1. key

2. value and

3. File Name.

Here our key is NullWritable ,Value will be the record for each key. And we named our file with key.

Once you run this code. You can see an output as below

Two files

1. Female-r-00000

2. Male-r-00000

Lets see the contents

You can find the code here .

Happy Hadooping..............

16 comments:

Abhinay27 December 2015 at 09:58
This comment has been removed by the author.
ReplyDelete
Replies
Abhinay27 December 2015 at 10:00
Hi Unmesha,

Is it possible to print both the mapper and reducer output in a single MR Job.
My requirement is I need to create two files, 2nd file is just an shrinked version of 1st file.
For Eg:
Output of File1 is:
A B C D E
A B C D E
A B C D E
F G H I J
F G H I J
F G H I J

Output of File2 is:
A C E
F H J

Currently I am doing Job Chaining. Output of job1 will be input of Job2

Job1 contains only Mapper Phase, Job2 Contains both MR Phases. Reducer in Job2 is just to remove the duplicate rows

Thanks
Abhinay
ReplyDelete
Replies
Abhinay30 December 2015 at 10:40
Thanks a lot for your time... I am able to do it using multiple outputs..wrote mapper output using multipleoutpu and reducer output normally.

Thanks,
Abhinay.
ReplyDelete
Replies
CompleteExamCollection30 January 2016 at 20:40
For latest and updated Cloudera certification dumps in PDF format contact us at completeexamcollection@gmail.com.
Refer our blog for more details http://completeexamcollection.blogspot.in/2015/04/cloudera-hadoop-certification-dumps.html
ReplyDelete
Replies
Sushma28 July 2016 at 10:31
Really a good piece of knowledge on Big Data and Hadoop. Thanks for such a good post. I would like to recommend one more resource NPN Training which helps in getting more knowledge on Hadoop. The best part of NPN Training is they provide complete Hands-on classes.

For More feedback visit
http://npntraining.com/testimonial.php
ReplyDelete
Replies
Unknown31 March 2017 at 06:01
Very helpful information. I made my alpinistic career by thissoftware training institute in bangalore. So I suggest you also to have a look here. Thankyou
ReplyDelete
Replies
periyannan21 February 2022 at 21:04
blog is nice and much interesting which engaged me more.Spend a worthful time.keep updating more.
Biotech Internships | internships for cse students | web designing course in chennai | it internships | electrical engineering internships | internship for bcom students | python training in chennai | web development internship | internship for bba students | internship for 1st year engineering students
ReplyDelete
Replies

Add comment