In statistics and probability theory, the median is the numerical value separating the higher half of a data sample, a population, or a probability distribution, from the lower half.
The median is the central point of a data set.
Consider the following data points: 1,4,5,6,7
The Median is "5".
Lets see how we will find median in Hive.
Consider a "test" table.
-------------------|Name Age|
------------------
|A 23 |
|B 23 |
|C 20 |
------------------
hive> select * from test;
OK
A 23
B 23
C 20
Time taken: 4.219 seconds, Fetched: 3 row(s)
Lets say we are going to find the median for Age column in "test" table.
Our expected median is "23".
PERCENTILE(BIGINT col,0.5) function helps to compute median in hive.The 50th percentile would be the median.
Structure of "test" table
hive> desc test;
OK
firstname string
age int
Time taken: 0.32 seconds, Fetched: 2 row(s)
Here we can see the column we are going to find median is in INT. We need to convert the column into BIGINT.
Lets try out the query
select percentile(cast(age as BIGINT), 0.5) from test;
Here we casted age column into BIGINT.
hive> select percentile(cast(age as BIGINT), 0.5) from test1; Query ID = aibladmin_20141211140606_c61cb042-ed14-4048-8270-4cea1eece1c7 Total jobs = 1 Launching Job 1 out of 1 . . OK 23.0 Time taken: 27.659 seconds, Fetched: 1 row(s)
23.0 is the expected result which is the median for [23,23,20].