Saturday, 17 May 2014

Count Frequency Of Values In A Column Using Apache Pig


There may be situations to count the occurence of a value in a field.
Let this be the sample input bag.


user_id   course_name user_name
1           Social      Anju
2           Maths       Malu
1           English     Anju
1           Maths       Anju

Say we need to calculate no of occurence of each user_name.
Anju 3
Malu 1

Inorder to achieve this - COUNT Built In Function can be used.


COUNT Function in Apache Pig


COUNT function  compute the number of elements in a bag.
To group count a preceding GROUP BY statement and for global counts GROUP ALL statement is required.

The basic idea to do the above example is to group by user_name and count the tuples in the bag.


--count.pig

 userAlias = LOAD '/home/sreeveni/myfiles/pig/count.txt' as 
             (user_id:long,course_name:chararray,user_name:chararray);
 groupedByUser = group userAlias by user_name;
 counted = FOREACH groupedByUser GENERATE group as user_name,COUNT(userAlias) as cnt;
 result = FOREACH counted GENERATE user_name, cnt;
 store result into '/home/sreeveni/myfiles/pig/OUT/count';

The COUNT function ignores NULLs, that is tuple in the bag will not be counted if the first field in this tuple is NULL.
COUNT_STAR can be used to count fields including NULL values.




1 comment:

  1. For latest and updated Cloudera certification dumps in PDF format contact us at completeexamcollection@gmail.com.
    Refer our blog for more details http://completeexamcollection.blogspot.in/2015/04/cloudera-hadoop-certification-dumps.html

    ReplyDelete