Here is Something !: Calculating Mean in Hadoop MapReduce

Saturday, 23 August 2014

Calculating Mean in Hadoop MapReduce

Given a csv files we will find mean of each column (Optimized approach)

Mapper

Takes each input line and calculate the sum and stores the no of lines it sumed.Then sum get stored in a hash map with key as column Id.cleanup emits the sum and total line count inorder to take the overall mean.As we know each map only gets a block of input data.So while summing up we need to know how many elements summed up.

//Calculating sum
if (sumVal.isEmpty()) {
 //if sumval is empty add elements to sumval
 sumVal.putAll(mapLine);
 } else {
//calculating sum
 double sum = 0;
 for (Integer colId : mapLine.keySet()) {
  double val1 = mapLine.get(colId);
  double val2 = sumVal.get(colId);
 /*
  * calculating sum
 */
 sum = val1 + val2;
 sumVal.put(colId, sum);
 }
}

Reducer

Sums of the values for each key.

Reducer calculates 2 sums.

Sums the values for each key and

Sums total no.of linecount

for (TwovalueWritable value : values) {
 //Taking sum of values and total number of lines 
 sum += value.getSum();
 total += value.getTotalCnt();
 }
 //sum contains total sum of all elements in each column
 //total contains total no of elements in each column
 mean = sum / total;
 valEmit.set(mean);
 context.write(key, valEmit);

This approach helps in avoiding a large no of communication with reducer.Reducer needs only to sum up few values from mapper.
Say we have only 3 mappers and 4 columns in input set.Reducer only want to wait for 4 values from each mapper(no.of columns also considered)

Complete code : GitHub Link

2 comments:

CompleteExamCollection30 January 2016 at 20:44
For latest and updated Cloudera certification dumps in PDF format contact us at completeexamcollection@gmail.com.
Refer our blog for more details http://completeexamcollection.blogspot.in/2015/04/cloudera-hadoop-certification-dumps.html
ReplyDelete
Replies
Peter Schiff17 January 2022 at 22:46
Mmatf Stock Real-Time Overview Of A Stock, Including Recent And Historical Price Charts, News, Events, Analyst Rating Changes And Other Key Stock Information.
ReplyDelete
Replies

Add comment

Saturday, 23 August 2014

Calculating Mean in Hadoop MapReduce

Given a csv files we will find mean of each column (Optimized approach)

Mapper

Reducer

Sums of the values for each key.

Reducer calculates 2 sums. Sums the values for each key and Sums total no.of linecount

This approach helps in avoiding a large no of communication with reducer.Reducer needs only to sum up few values from mapper.Say we have only 3 mappers and 4 columns in input set.Reducer only want to wait for 4 values from each mapper(no.of columns also considered)

2 comments:

Reducer calculates 2 sums.

Sums the values for each key and

Sums total no.of linecount

This approach helps in avoiding a large no of communication with reducer.Reducer needs only to sum up few values from mapper.
Say we have only 3 mappers and 4 columns in input set.Reducer only want to wait for 4 values from each mapper(no.of columns also considered)