Showing posts with label count in apache pig. Show all posts
Showing posts with label count in apache pig. Show all posts

Saturday, 17 May 2014

Count Frequency Of Values In A Column Using Apache Pig


There may be situations to count the occurence of a value in a field.
Let this be the sample input bag.


user_id   course_name user_name
1           Social      Anju
2           Maths       Malu
1           English     Anju
1           Maths       Anju

Say we need to calculate no of occurence of each user_name.
Anju 3
Malu 1

Inorder to achieve this - COUNT Built In Function can be used.


COUNT Function in Apache Pig


COUNT function  compute the number of elements in a bag.
To group count a preceding GROUP BY statement and for global counts GROUP ALL statement is required.

The basic idea to do the above example is to group by user_name and count the tuples in the bag.


--count.pig

 userAlias = LOAD '/home/sreeveni/myfiles/pig/count.txt' as 
             (user_id:long,course_name:chararray,user_name:chararray);
 groupedByUser = group userAlias by user_name;
 counted = FOREACH groupedByUser GENERATE group as user_name,COUNT(userAlias) as cnt;
 result = FOREACH counted GENERATE user_name, cnt;
 store result into '/home/sreeveni/myfiles/pig/OUT/count';

The COUNT function ignores NULLs, that is tuple in the bag will not be counted if the first field in this tuple is NULL.
COUNT_STAR can be used to count fields including NULL values.