Here is Something !: Hive Partitioning

Tuesday, 18 November 2014

Hive Partitioning

Partitions are horizontal record of data which allows large datasets to get seperated into more managable chunks. In Hive, partitioning is supported for both managed dataset in folders and for external tables also.

1. Hive partition for external tables

Load data into HDFS

Data resides in /user/unmesha/HiveTrail/emp.txt. The file emp.txt is a sample employee data.

1,Anne,Admin,50000,A
2,Gokul,Admin,50000,B
3,Janet,Sales,60000,A
4,Hari,Admin,50000,C
5,Sanker,Admin,50000,C
6,Margaret,Tech,12000,A
7,Nirmal,Tech,12000,B
8,jinju,Engineer,45000,B
9,Nancy,Admin,50000,A
10,Andrew,Manager,40000,A
11,Arun,Manager,40000,B
12,Harish,Sales,60000,B
13,Robert,Manager,40000,A
14,Laura,Engineer,45000,A
15,Anju,Ceo,100000,B
16,Aarathi,Manager,40000,B
17,Parvathy,Engineer,45000,B
18,Gopika,Admin,50000,B
19,Steven,Engineer,45000,A
20,Michael,Ceo,100000,A

We are going to partition this dataset into 3 Departments A,B,C

2. Create a non partioned table to store the data (Staging table)

create external table Unm_Dup_Parti (EmployeeID Int,FirstName String,Designation  String,Salary Int,Department String) row format delimited fields terminated by "," location '/user/unmesha/HiveTrail';

3. Create Partitioned hive table

create  table Unm_Parti (EmployeeID Int,FirstName String,Designation  String,Salary Int) PARTITIONED BY (Department String) row format delimited fields terminated by ",";

Here we are creating partition for Department by using PARTITIONED BY.

4. Insert data into Partitioned table, by using select clause

There are 2 ways to insert data into partition table.

1. Static Partition - Using individual insert

INSERT INTO TABLE Unm_Parti PARTITION(department='A') 
SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='A'; 

INSERT INTO TABLE Unm_Parti PARTITION (department='B') 
SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='B'; 

INSERT INTO TABLE Unm_Parti PARTITION (department='C') 
SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='C';

If we go for the above approach , if we have 50 partitions we need to do the insert statement 50 times. That is a tedeous task and it is known as Static Partition.

2. Dynamic Partition – Single insert to partition table
Inorder to achieve the same we need to set 4 things,

1. set hive.exec.dynamic.partition=true
     This enable dynamic partitions, by default it is false.
2. set hive.exec.dynamic.partition.mode=nonstrict
     We are using the dynamic partition without a static
     partition (A table can be partitioned based    
     on multiple columns in hive) in such case we have to           
     enable the non strict mode. In strict mode we can use             
     dynamic partition  only with a Static Partition.
3. set hive.exec.max.dynamic.partitions.pernode=3
     The default value is 100, we have to modify the   
     same according to the possible no of partitions
4. hive.exec.max.created.files=150000
     The default values is 100000 but for larger tables  
     it can exceed the default, so we may have to update the same.

INSERT OVERWRITE TABLE Unm_Parti PARTITION(department) SELECT EmployeeID, FirstName,Designation,Salary,department FROM Unm_Dup_Parti;

If the table is large enough the above query wont work seems like due to the larger number of files created on initial map task.

So in that cases group the records in your hive query on the map process and process them on the reduce side. You can implement the same in your hive query itself with the usage of DISTRIBUTE BY. Below is the query .

FROM Unm_Dup_Parti 
INSERT OVERWRITE TABLE Unm_Parti PARTITION(department) 
SELECT EmployeeID, FirstName,Designation,Salary,department DISTRIBUTE BY department;

With this approach you don’t need to overwrite the hive.exec.max.created.files parameter.

2. Partition on managed Data in HDFS

1. Data are filtered and seperated to different folders in HDFS

2. Create table with partition

create external table Unm_Parti (EmployeeID Int,FirstName String,Designation  String,Salary Int) PARTITIONED BY (Department String) row format delimited fields terminated by "," ;

2. Load data into Unm_Parti table using ALTER statement

ALTER TABLE Unm_Parti ADD PARTITION (Department='A')
location '/user/unmesha/HIVE/HiveTrailFolder/A';

ALTER TABLE Unm_Parti ADD PARTITION (Department='B')
location '/user/unmesha/HIVE/HiveTrailFolder/B';

ALTER TABLE Unm_Parti ADD PARTITION (Department='C')
location '/user/unmesha/HIVE/HiveTrailFolder/C';