Saturday, 17 May 2014

Count Frequency Of Values In A Column Using Apache Pig

There may be situations to count the occurence of a value in a field.
Let this be the sample input bag.

user_id   course_name user_name
1           Social      Anju
2           Maths       Malu
1           English     Anju
1           Maths       Anju

Say we need to calculate no of occurence of each user_name.
Anju 3
Malu 1

Inorder to achieve this - COUNT Built In Function can be used.

COUNT Function in Apache Pig

COUNT function  compute the number of elements in a bag.
To group count a preceding GROUP BY statement and for global counts GROUP ALL statement is required.

The basic idea to do the above example is to group by user_name and count the tuples in the bag.


 userAlias = LOAD '/home/sreeveni/myfiles/pig/count.txt' as 
 groupedByUser = group userAlias by user_name;
 counted = FOREACH groupedByUser GENERATE group as user_name,COUNT(userAlias) as cnt;
 result = FOREACH counted GENERATE user_name, cnt;
 store result into '/home/sreeveni/myfiles/pig/OUT/count';

The COUNT function ignores NULLs, that is tuple in the bag will not be counted if the first field in this tuple is NULL.
COUNT_STAR can be used to count fields including NULL values.

1 comment:

  1. For latest and updated Cloudera certification dumps in PDF format contact us at
    Refer our blog for more details