Saturday, 23 August 2014

Calculating Mean in Hadoop MapReduce

Given a csv files we will find mean of each column (Optimized approach)


 Takes each input line and calculate the sum and stores the no of lines it sumed.Then sum get stored in a hash map with key as column Id.cleanup emits the sum and total line count inorder to take the overall mean.As we know each map only gets a block of input data.So while summing up we need to know how many elements summed up.

//Calculating sum
if (sumVal.isEmpty()) {
 //if sumval is empty add elements to sumval
 } else {
//calculating sum
 double sum = 0;
 for (Integer colId : mapLine.keySet()) {
  double val1 = mapLine.get(colId);
  double val2 = sumVal.get(colId);
  * calculating sum
 sum = val1 + val2;
 sumVal.put(colId, sum);


 Sums of the values for each key.

Reducer calculates 2 sums.
  1. Sums the values for each key and 
  2. Sums total no.of linecount

for (TwovalueWritable value : values) {
 //Taking sum of values and total number of lines 
 sum += value.getSum();
 total += value.getTotalCnt();
 //sum contains total sum of all elements in each column
 //total contains total no of elements in each column
 mean = sum / total;
 context.write(key, valEmit);

This approach helps in avoiding a large no of communication with reducer.Reducer needs only to sum up few values from mapper.
Say we have only 3 mappers and 4 columns in input set.Reducer only want to wait for 4 values from each mapper(no.of columns also considered)

Complete code : GitHub Link

1 comment:

  1. For latest and updated Cloudera certification dumps in PDF format contact us at
    Refer our blog for more details