Partitions are horizontal record of data which allows large datasets to get seperated into more managable chunks. In Hive, partitioning is supported for both managed dataset in folders and for external tables also.
1. Hive partition for external tables
- Load data into HDFS
Data resides in /user/unmesha/HiveTrail/emp.txt. The file emp.txt is a sample employee data.
1,Anne,Admin,50000,A
2,Gokul,Admin,50000,B
3,Janet,Sales,60000,A
4,Hari,Admin,50000,C
5,Sanker,Admin,50000,C
6,Margaret,Tech,12000,A
7,Nirmal,Tech,12000,B
8,jinju,Engineer,45000,B
9,Nancy,Admin,50000,A
10,Andrew,Manager,40000,A
11,Arun,Manager,40000,B
12,Harish,Sales,60000,B
13,Robert,Manager,40000,A
14,Laura,Engineer,45000,A
15,Anju,Ceo,100000,B
16,Aarathi,Manager,40000,B
17,Parvathy,Engineer,45000,B
18,Gopika,Admin,50000,B
19,Steven,Engineer,45000,A
20,Michael,Ceo,100000,A
We are going to partition this dataset into 3 Departments A,B,C
2. Create a non partioned table to store the data (Staging table)
create external table Unm_Dup_Parti (EmployeeID Int,FirstName String,Designation String,Salary Int,Department String) row format delimited fields terminated by "," location '/user/unmesha/HiveTrail';
create table Unm_Parti (EmployeeID Int,FirstName String,Designation String,Salary Int) PARTITIONED BY (Department String) row format delimited fields terminated by ",";
Here we are creating partition for Department by using PARTITIONED BY.
4. Insert data into Partitioned table, by using select clause
There are 2 ways to insert data into partition table.1. Static Partition - Using individual insert
INSERT INTO TABLE Unm_Parti PARTITION(department='A') SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='A'; INSERT INTO TABLE Unm_Parti PARTITION (department='B') SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='B'; INSERT INTO TABLE Unm_Parti PARTITION (department='C') SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='C';
2. Dynamic Partition – Single insert to partition table
Inorder to achieve the same we need to set 4 things,
1. set hive.exec.dynamic.partition=true This enable dynamic partitions, by default it is false. 2. set hive.exec.dynamic.partition.mode=nonstrict We are using the dynamic partition without a static partition (A table can be partitioned based on multiple columns in hive) in such case we have to enable the non strict mode. In strict mode we can use dynamic partition only with a Static Partition. 3. set hive.exec.max.dynamic.partitions.pernode=3 The default value is 100, we have to modify the same according to the possible no of partitions 4. hive.exec.max.created.files=150000 The default values is 100000 but for larger tables it can exceed the default, so we may have to update the same.
INSERT OVERWRITE TABLE Unm_Parti PARTITION(department) SELECT EmployeeID, FirstName,Designation,Salary,department FROM Unm_Dup_Parti;
So in that cases group the records in your hive query on the map process and process them on the reduce side. You can implement the same in your hive query itself with the usage of DISTRIBUTE BY. Below is the query .
FROM Unm_Dup_Parti INSERT OVERWRITE TABLE Unm_Parti PARTITION(department) SELECT EmployeeID, FirstName,Designation,Salary,department DISTRIBUTE BY department;
With this approach you don’t need to overwrite the hive.exec.max.created.files parameter.
2. Partition on managed Data in HDFS
1. Data are filtered and seperated to different folders in HDFS
2. Create table with partition
create external table Unm_Parti (EmployeeID Int,FirstName String,Designation String,Salary Int) PARTITIONED BY (Department String) row format delimited fields terminated by "," ;
2. Load data into Unm_Parti table using ALTER statement
ALTER TABLE Unm_Parti ADD PARTITION (Department='A') location '/user/unmesha/HIVE/HiveTrailFolder/A'; ALTER TABLE Unm_Parti ADD PARTITION (Department='B') location '/user/unmesha/HIVE/HiveTrailFolder/B'; ALTER TABLE Unm_Parti ADD PARTITION (Department='C') location '/user/unmesha/HIVE/HiveTrailFolder/C';
Nice!
ReplyDeleteWhat if the table outgrows the number of partitions?? Maybe the hive.exec.max.dynamic.partitions.pernode should be set to default??
Awesome work, short and straight
ReplyDelete?
ReplyDeleteIf i want to create a External table, which location i should mention in location keyword?
ReplyDeleteEither '/home/training/anbu/ex_part' or '/user/hive/warehouse/ex_part'(hdfs location)?
@Anbu Eswaran: You should use the hdfs location.
ReplyDeleteHi ,which part of hive query belongs to map side process and which part belongs to reduce side process? please explain.
ReplyDeletevery good explained.Working fine.Thanks.
ReplyDeleteGreat
DeleteGreat article!!! Short and Effective
ReplyDeleteGood explanation,Thanks!
ReplyDeleteHow do I add sub partition
ReplyDeleteHow do I add sub partition
ReplyDeleteexcellent
ReplyDeleteWhich of the partitions is better, static or dynamic from performance point of view?
ReplyDeleteGreat job! thanks for helping..
ReplyDeleteThanks chandra
Deletedone!!!!Thank you very much :)
ReplyDeleteHi, If want partitioned by (year int,month string, date int).... How would be update query.Please help me in this..
ReplyDeletevery very valuable hadoop/hive tip, thanks to share your knowledge, please share more Hive interview tips
ReplyDeletevery well explained!
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThis is the best blog for learning Hive Partitioning
ReplyDeletehttp://quicktostudy.com/hive/hive-dynamic-partition.asp
ReplyDeletehttp://quicktostudy.com/hive/hive-partitions.asp
Nice explanation..thanks for this..
ReplyDeletehttp://bit.ly/2bZrnGP
Now you make it easy for me to understand and implement the concept. Thank you for the post really a great efforts. I am really happy to see your blog.cubicles in West Palm Beach
ReplyDeleteelegant and simple! Thanks.
ReplyDeletefolder are created as per department but i cant see the data in respective folder.
ReplyDeleteselect * from Unm_Parti where Department = 'A'; also not working.
help me out.
thanks in advance
shala khali copy paste . kuch to original likh .
ReplyDeleteHi Sree, I am trying to come up with best possible solution for the following scenario, Have external hive table partition by effective_date loaded using ETL so I will have a new file each day with new data. Since partition column is at end, when I add new columns to the table what is the best way to correct the existing file system and avoid reload of complete table.
ReplyDeleteHi,
ReplyDeleteThanks for helping me to understand hive partitioning clearly, great job.
Hadoop Training in Velachery | Hadoop Training .
Hadoop Training in Chennai | Hadoop .
This comment has been removed by the author.
ReplyDeleteYou are doing a great job by sharing useful information about Hadoop course. It is one of the post to read and improve my knowledge in Hadoop.You can check our Hadoop Hive Partition example,for more information about Hadoop static partition in hive.
ReplyDeleteI couldn't find out that which location i need to put while creating external table
ReplyDeleteTo keep current, you need to find trustworthy technology news sources that will provide you with up-to-date information.
ReplyDeleteSamsung Flash Drive
Green Dinosaur Pokemon
Pubfilm