Thursday 11 December 2014

Computing Median In Hive

In statistics and probability theory, the median is the numerical value separating the higher half of a data sample, a population, or a probability distribution, from the lower half.

The median is the central point of a data set.

Consider the following data points: 1,4,5,6,7
The Median is "5".

Lets see how we will find median in Hive.

Consider a "test" table.
-------------------
|Name Age|
------------------
|A  23 |
|B    23 |
|C  20 |
------------------
hive> select * from test;
OK
A 23
B 23
C 20
Time taken: 4.219 seconds, Fetched: 3 row(s)

Lets say we are going to find the median for Age column in "test" table.
Our expected median is "23".

PERCENTILE(BIGINT col,0.5) function helps to compute median in hive.The 50th percentile would be the median.

Structure of  "test" table
hive> desc test;      
OK
firstname            string                                   
age                  int                                      
Time taken: 0.32 seconds, Fetched: 2 row(s)

Here we can see the column we are going to find median is in INT. We need to convert the column into BIGINT.

Lets try out the query
select percentile(cast(age as BIGINT), 0.5) from test; 
Here we casted age column into BIGINT.
hive> select percentile(cast(age as BIGINT), 0.5) from test1; 
Query ID = aibladmin_20141211140606_c61cb042-ed14-4048-8270-4cea1eece1c7 
Total jobs = 1 
Launching Job 1 out of 1 
.
.
OK 
23.0 
Time taken: 27.659 seconds, Fetched: 1 row(s)
23.0 is the expected result which is the median for [23,23,20].

16 comments:

  1. i have also cleared CCDH-410 exam with 82%

    ReplyDelete
    Replies
    1. https://commons.wikimedia.org/wiki/File:CCDH-410.JPG

      Delete
  2. Really a good piece of knowledge on Big Data and Hadoop. Thanks for such a good post. I would like to recommend one more resource NPN Training which helps in getting more knowledge on Hadoop. The best part of NPN Training is they provide complete Hands-on classes.

    For More feedback visit
    http://npntraining.com/testimonial.php

    ReplyDelete



  3. Great and useful article. Creating content regularly is very tough. Your points are motivated me to move on.



    Manual testing training in Chennai

    Selenium training in Chennai

    Software testing training in Chennai

    ReplyDelete
  4. It's interesting that many of the bloggers your tips helped to clarify a few things for me as well as giving.. very specific nice content. And tell people specific ways to live their lives.Sometimes you just have to yell at people and give them a good shake to get your point across.

    Digital Marketing Training in Chennai

    Hadoop Training in Chennai

    ReplyDelete
  5. Can you please tell how to compute statistical mode in hive for multiple columns in a single query

    ReplyDelete
  6. Thanks for the great article this is very useful info thanks for the wonderful post.

    Best Rpa Online Training

    ReplyDelete
  7. this article is really helpful thank you for this article. I want to suggest my own post also please go through it once https://indiancybersecuritysolutions.com/python-training-in-bangalore/

    ReplyDelete
  8. article is really helpful thank you for this article.

    salesforce online training

    ReplyDelete
  9. As the growth of Big data platform managed service , it is essential to spread knowledge in people. This meetup will work as a burst of awareness.

    ReplyDelete
  10. What is a casino site? | Lucky Club
    There is no online casino to play games luckyclub with but the easiest way to get started is by reading our guide on how to play slots. How ‎Slots Games · ‎Online Casino Sites · ‎Slots Casinos

    ReplyDelete