Tuesday, 18 November 2014

Hive Partitioning

 Partitions are horizontal record of data which allows large datasets to get seperated into more managable chunks. In Hive, partitioning is supported for both managed dataset in folders and for external tables also.


1. Hive partition for external tables

  1. Load data into HDFS

       Data resides in /user/unmesha/HiveTrail/emp.txt. The file emp.txt is a sample employee data.


1,Anne,Admin,50000,A
2,Gokul,Admin,50000,B
3,Janet,Sales,60000,A
4,Hari,Admin,50000,C
5,Sanker,Admin,50000,C
6,Margaret,Tech,12000,A
7,Nirmal,Tech,12000,B
8,jinju,Engineer,45000,B
9,Nancy,Admin,50000,A
10,Andrew,Manager,40000,A
11,Arun,Manager,40000,B
12,Harish,Sales,60000,B
13,Robert,Manager,40000,A
14,Laura,Engineer,45000,A
15,Anju,Ceo,100000,B
16,Aarathi,Manager,40000,B
17,Parvathy,Engineer,45000,B
18,Gopika,Admin,50000,B
19,Steven,Engineer,45000,A
20,Michael,Ceo,100000,A

We are going to partition this dataset into 3 Departments A,B,C


 2. Create a non partioned table to store the data (Staging table)

create external table Unm_Dup_Parti (EmployeeID Int,FirstName String,Designation  String,Salary Int,Department String) row format delimited fields terminated by "," location '/user/unmesha/HiveTrail';

3. Create Partitioned hive table
create  table Unm_Parti (EmployeeID Int,FirstName String,Designation  String,Salary Int) PARTITIONED BY (Department String) row format delimited fields terminated by ","; 
Here we are creating partition for Department by using PARTITIONED BY.

4. Insert data into Partitioned table, by using select clause

       There are 2 ways to insert data into partition table.

 1. Static Partition - Using individual insert
INSERT INTO TABLE Unm_Parti PARTITION(department='A') 
SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='A'; 

INSERT INTO TABLE Unm_Parti PARTITION (department='B') 
SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='B'; 

INSERT INTO TABLE Unm_Parti PARTITION (department='C') 
SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='C';

  If we go for the above approach , if we have 50 partitions we need to do the insert statement 50 times. That is a tedeous task and it is known as Static Partition.

 2. Dynamic Partition – Single insert to partition table
             Inorder to achieve the same we need to set 4 things,
1. set hive.exec.dynamic.partition=true
     This enable dynamic partitions, by default it is false.
2. set hive.exec.dynamic.partition.mode=nonstrict
     We are using the dynamic partition without a static
     partition (A table can be partitioned based    
     on multiple columns in hive) in such case we have to           
     enable the non strict mode. In strict mode we can use             
     dynamic partition  only with a Static Partition.
3. set hive.exec.max.dynamic.partitions.pernode=3
     The default value is 100, we have to modify the   
     same according to the possible no of partitions
4. hive.exec.max.created.files=150000
     The default values is 100000 but for larger tables  
     it can exceed the default, so we may have to update the same.            
INSERT OVERWRITE TABLE Unm_Parti PARTITION(department) SELECT EmployeeID, FirstName,Designation,Salary,department FROM Unm_Dup_Parti; 

If the table is large enough the above query wont work seems like due to the larger number of files created on initial map task. 

So in that cases group the records in your hive query on the map process and process them on the reduce side. You can implement the same in your hive query itself with the usage of DISTRIBUTE BY. Below is the query .
FROM Unm_Dup_Parti 
INSERT OVERWRITE TABLE Unm_Parti PARTITION(department) 
SELECT EmployeeID, FirstName,Designation,Salary,department DISTRIBUTE BY department;
With this approach you don’t need to overwrite the hive.exec.max.created.files parameter.


2. Partition on managed Data in HDFS


 1. Data are filtered and seperated to different folders in HDFS

2. Create table with partition

create external table Unm_Parti (EmployeeID Int,FirstName String,Designation  String,Salary Int) PARTITIONED BY (Department String) row format delimited fields terminated by "," ;

 2. Load data into Unm_Parti table using ALTER statement

ALTER TABLE Unm_Parti ADD PARTITION (Department='A')
location '/user/unmesha/HIVE/HiveTrailFolder/A';

ALTER TABLE Unm_Parti ADD PARTITION (Department='B')
location '/user/unmesha/HIVE/HiveTrailFolder/B';

ALTER TABLE Unm_Parti ADD PARTITION (Department='C')
location '/user/unmesha/HIVE/HiveTrailFolder/C';



34 comments:

  1. Nice!

    What if the table outgrows the number of partitions?? Maybe the hive.exec.max.dynamic.partitions.pernode should be set to default??

    ReplyDelete
  2. Awesome work, short and straight

    ReplyDelete
  3. If i want to create a External table, which location i should mention in location keyword?
    Either '/home/training/anbu/ex_part' or '/user/hive/warehouse/ex_part'(hdfs location)?

    ReplyDelete
  4. @Anbu Eswaran: You should use the hdfs location.

    ReplyDelete
  5. Hi ,which part of hive query belongs to map side process and which part belongs to reduce side process? please explain.

    ReplyDelete
  6. very good explained.Working fine.Thanks.

    ReplyDelete
  7. Great article!!! Short and Effective

    ReplyDelete
  8. Which of the partitions is better, static or dynamic from performance point of view?

    ReplyDelete
  9. Great job! thanks for helping..

    ReplyDelete
  10. done!!!!Thank you very much :)

    ReplyDelete
  11. Hi, If want partitioned by (year int,month string, date int).... How would be update query.Please help me in this..

    ReplyDelete
  12. very very valuable hadoop/hive tip, thanks to share your knowledge, please share more Hive interview tips

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. This is the best blog for learning Hive Partitioning

    ReplyDelete
  15. http://quicktostudy.com/hive/hive-dynamic-partition.asp
    http://quicktostudy.com/hive/hive-partitions.asp

    ReplyDelete
  16. Now you make it easy for me to understand and implement the concept. Thank you for the post really a great efforts. I am really happy to see your blog.cubicles in West Palm Beach

    ReplyDelete
  17. folder are created as per department but i cant see the data in respective folder.
    select * from Unm_Parti where Department = 'A'; also not working.
    help me out.
    thanks in advance

    ReplyDelete
  18. shala khali copy paste . kuch to original likh .

    ReplyDelete
  19. Hi Sree, I am trying to come up with best possible solution for the following scenario, Have external hive table partition by effective_date loaded using ETL so I will have a new file each day with new data. Since partition column is at end, when I add new columns to the table what is the best way to correct the existing file system and avoid reload of complete table.

    ReplyDelete
  20. Hi,
    Thanks for helping me to understand hive partitioning clearly, great job.
    Hadoop Training in Velachery | Hadoop Training .
    Hadoop Training in Chennai | Hadoop .

    ReplyDelete
  21. This comment has been removed by the author.

    ReplyDelete
  22. You are doing a great job by sharing useful information about Hadoop course. It is one of the post to read and improve my knowledge in Hadoop.You can check our Hadoop Hive Partition example,for more information about Hadoop static partition in hive.

    ReplyDelete
  23. I couldn't find out that which location i need to put while creating external table

    ReplyDelete
  24. To keep current, you need to find trustworthy technology news sources that will provide you with up-to-date information.

    Samsung Flash Drive
    Green Dinosaur Pokemon
    Pubfilm

    ReplyDelete