Here is Something !: Hive Partitioning

Tuesday, 18 November 2014

Hive Partitioning

Partitions are horizontal record of data which allows large datasets to get seperated into more managable chunks. In Hive, partitioning is supported for both managed dataset in folders and for external tables also.

1. Hive partition for external tables

Load data into HDFS

Data resides in /user/unmesha/HiveTrail/emp.txt. The file emp.txt is a sample employee data.

1,Anne,Admin,50000,A
2,Gokul,Admin,50000,B
3,Janet,Sales,60000,A
4,Hari,Admin,50000,C
5,Sanker,Admin,50000,C
6,Margaret,Tech,12000,A
7,Nirmal,Tech,12000,B
8,jinju,Engineer,45000,B
9,Nancy,Admin,50000,A
10,Andrew,Manager,40000,A
11,Arun,Manager,40000,B
12,Harish,Sales,60000,B
13,Robert,Manager,40000,A
14,Laura,Engineer,45000,A
15,Anju,Ceo,100000,B
16,Aarathi,Manager,40000,B
17,Parvathy,Engineer,45000,B
18,Gopika,Admin,50000,B
19,Steven,Engineer,45000,A
20,Michael,Ceo,100000,A

We are going to partition this dataset into 3 Departments A,B,C

2. Create a non partioned table to store the data (Staging table)

create external table Unm_Dup_Parti (EmployeeID Int,FirstName String,Designation  String,Salary Int,Department String) row format delimited fields terminated by "," location '/user/unmesha/HiveTrail';

3. Create Partitioned hive table

create  table Unm_Parti (EmployeeID Int,FirstName String,Designation  String,Salary Int) PARTITIONED BY (Department String) row format delimited fields terminated by ",";

Here we are creating partition for Department by using PARTITIONED BY.

4. Insert data into Partitioned table, by using select clause

There are 2 ways to insert data into partition table.

1. Static Partition - Using individual insert

INSERT INTO TABLE Unm_Parti PARTITION(department='A') 
SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='A'; 

INSERT INTO TABLE Unm_Parti PARTITION (department='B') 
SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='B'; 

INSERT INTO TABLE Unm_Parti PARTITION (department='C') 
SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='C';

If we go for the above approach , if we have 50 partitions we need to do the insert statement 50 times. That is a tedeous task and it is known as Static Partition.

2. Dynamic Partition – Single insert to partition table
Inorder to achieve the same we need to set 4 things,

1. set hive.exec.dynamic.partition=true
     This enable dynamic partitions, by default it is false.
2. set hive.exec.dynamic.partition.mode=nonstrict
     We are using the dynamic partition without a static
     partition (A table can be partitioned based    
     on multiple columns in hive) in such case we have to           
     enable the non strict mode. In strict mode we can use             
     dynamic partition  only with a Static Partition.
3. set hive.exec.max.dynamic.partitions.pernode=3
     The default value is 100, we have to modify the   
     same according to the possible no of partitions
4. hive.exec.max.created.files=150000
     The default values is 100000 but for larger tables  
     it can exceed the default, so we may have to update the same.

INSERT OVERWRITE TABLE Unm_Parti PARTITION(department) SELECT EmployeeID, FirstName,Designation,Salary,department FROM Unm_Dup_Parti;

If the table is large enough the above query wont work seems like due to the larger number of files created on initial map task.

So in that cases group the records in your hive query on the map process and process them on the reduce side. You can implement the same in your hive query itself with the usage of DISTRIBUTE BY. Below is the query .

FROM Unm_Dup_Parti 
INSERT OVERWRITE TABLE Unm_Parti PARTITION(department) 
SELECT EmployeeID, FirstName,Designation,Salary,department DISTRIBUTE BY department;

With this approach you don’t need to overwrite the hive.exec.max.created.files parameter.

2. Partition on managed Data in HDFS

1. Data are filtered and seperated to different folders in HDFS

2. Create table with partition

create external table Unm_Parti (EmployeeID Int,FirstName String,Designation  String,Salary Int) PARTITIONED BY (Department String) row format delimited fields terminated by "," ;

2. Load data into Unm_Parti table using ALTER statement

ALTER TABLE Unm_Parti ADD PARTITION (Department='A')
location '/user/unmesha/HIVE/HiveTrailFolder/A';

ALTER TABLE Unm_Parti ADD PARTITION (Department='B')
location '/user/unmesha/HIVE/HiveTrailFolder/B';

ALTER TABLE Unm_Parti ADD PARTITION (Department='C')
location '/user/unmesha/HIVE/HiveTrailFolder/C';

34 comments:

SR16 April 2015 at 12:14
Nice!

What if the table outgrows the number of partitions?? Maybe the hive.exec.max.dynamic.partitions.pernode should be set to default??
ReplyDelete
Replies
reavanth10 May 2015 at 17:36
Awesome work, short and straight
ReplyDelete
Replies
Unknown12 May 2015 at 02:22
?
ReplyDelete
Replies
Unknown12 May 2015 at 02:27
If i want to create a External table, which location i should mention in location keyword?
Either '/home/training/anbu/ex_part' or '/user/hive/warehouse/ex_part'(hdfs location)?
ReplyDelete
Replies
Unknown27 July 2015 at 23:07
@Anbu Eswaran: You should use the hdfs location.
ReplyDelete
Replies
Unknown29 July 2015 at 23:07
Hi ,which part of hive query belongs to map side process and which part belongs to reduce side process? please explain.
ReplyDelete
Replies
Unknown26 October 2015 at 01:10
very good explained.Working fine.Thanks.
ReplyDelete
Replies
Unknown3 December 2015 at 00:20
Great article!!! Short and Effective
ReplyDelete
Replies
akhil4 December 2015 at 23:10
Good explanation,Thanks!
ReplyDelete
Replies
Ghanchakkar19 December 2015 at 19:23
How do I add sub partition
ReplyDelete
Replies
Ghanchakkar19 December 2015 at 19:24
How do I add sub partition
ReplyDelete
Replies
TechDude8 February 2016 at 21:14
excellent
ReplyDelete
Replies
Shilpa18 February 2016 at 22:36
Which of the partitions is better, static or dynamic from performance point of view?
ReplyDelete
Replies
Unknown27 March 2016 at 12:47
Great job! thanks for helping..
ReplyDelete
Replies
pa13 April 2016 at 03:11
done!!!!Thank you very much :)
ReplyDelete
Replies
Anonymous22 April 2016 at 01:57
Hi, If want partitioned by (year int,month string, date int).... How would be update query.Please help me in this..
ReplyDelete
Replies
Venu12 May 2016 at 06:33
very very valuable hadoop/hive tip, thanks to share your knowledge, please share more Hive interview tips
ReplyDelete
Replies
Vini11 June 2016 at 20:19
very well explained!
ReplyDelete
Replies
Anonymous6 July 2016 at 04:57
This comment has been removed by the author.
ReplyDelete
Replies
Anonymous6 July 2016 at 04:59
This is the best blog for learning Hive Partitioning
ReplyDelete
Replies
Gyana5 August 2016 at 03:18
http://quicktostudy.com/hive/hive-dynamic-partition.asp
http://quicktostudy.com/hive/hive-partitions.asp
ReplyDelete
Replies
Unknown12 September 2016 at 02:01
Nice explanation..thanks for this..
http://bit.ly/2bZrnGP
ReplyDelete
Replies
James E Lutz3 October 2016 at 09:27
Now you make it easy for me to understand and implement the concept. Thank you for the post really a great efforts. I am really happy to see your blog.cubicles in West Palm Beach
ReplyDelete
Replies
Rakesh17 November 2016 at 19:08
elegant and simple! Thanks.
ReplyDelete
Replies
Dinesh25 April 2017 at 12:12
folder are created as per department but i cant see the data in respective folder.
select * from Unm_Parti where Department = 'A'; also not working.
help me out.
thanks in advance
ReplyDelete
Replies
Unknown29 April 2017 at 06:14
shala khali copy paste . kuch to original likh .
ReplyDelete
Replies
Sunil17 January 2018 at 10:57
Hi Sree, I am trying to come up with best possible solution for the following scenario, Have external hive table partition by effective_date loaded using ETL so I will have a new file each day with new data. Since partition column is at end, when I add new columns to the table what is the best way to correct the existing file system and avoid reload of complete table.
ReplyDelete
Replies
Unknown9 April 2018 at 02:55
Hi,
Thanks for helping me to understand hive partitioning clearly, great job.
Hadoop Training in Velachery | Hadoop Training .
Hadoop Training in Chennai | Hadoop .
ReplyDelete
Replies
Tejuteju20 April 2018 at 06:40
This comment has been removed by the author.
ReplyDelete
Replies
Padminiprwatech27 February 2020 at 21:24
You are doing a great job by sharing useful information about Hadoop course. It is one of the post to read and improve my knowledge in Hadoop.You can check our Hadoop Hive Partition example,for more information about Hadoop static partition in hive.
ReplyDelete
Replies
Abhishek Chavan21 January 2022 at 23:41
I couldn't find out that which location i need to put while creating external table
ReplyDelete
Replies
aditya18 May 2022 at 00:20
To keep current, you need to find trustworthy technology news sources that will provide you with up-to-date information.

Samsung Flash Drive
Green Dinosaur Pokemon
Pubfilm
ReplyDelete
Replies

Add comment