Partitions are horizontal record of data which allows large datasets to get seperated into more managable chunks. In Hive, partitioning is supported for both managed dataset in folders and for external tables also.
1. Hive partition for external tables
- Load data into HDFS
Data resides in /user/unmesha/HiveTrail/emp.txt. The file emp.txt is a sample employee data.
1,Anne,Admin,50000,A
2,Gokul,Admin,50000,B
3,Janet,Sales,60000,A
4,Hari,Admin,50000,C
5,Sanker,Admin,50000,C
6,Margaret,Tech,12000,A
7,Nirmal,Tech,12000,B
8,jinju,Engineer,45000,B
9,Nancy,Admin,50000,A
10,Andrew,Manager,40000,A
11,Arun,Manager,40000,B
12,Harish,Sales,60000,B
13,Robert,Manager,40000,A
14,Laura,Engineer,45000,A
15,Anju,Ceo,100000,B
16,Aarathi,Manager,40000,B
17,Parvathy,Engineer,45000,B
18,Gopika,Admin,50000,B
19,Steven,Engineer,45000,A
20,Michael,Ceo,100000,A
We are going to partition this dataset into 3 Departments A,B,C
2. Create a non partioned table to store the data (Staging table)
create external table Unm_Dup_Parti (EmployeeID Int,FirstName String,Designation String,Salary Int,Department String) row format delimited fields terminated by "," location '/user/unmesha/HiveTrail';
3. Create Partitioned hive table
create table Unm_Parti (EmployeeID Int,FirstName String,Designation String,Salary Int) PARTITIONED BY (Department String) row format delimited fields terminated by ",";
Here we are creating partition for Department by using PARTITIONED BY.
4. Insert data into Partitioned table, by using select clause
There are 2 ways to insert data into partition table.
1. Static Partition - Using individual insert
INSERT INTO TABLE Unm_Parti PARTITION(department='A')
SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='A';
INSERT INTO TABLE Unm_Parti PARTITION (department='B')
SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='B';
INSERT INTO TABLE Unm_Parti PARTITION (department='C')
SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='C';
If we go for the above approach , if we have 50 partitions we need to do the insert statement 50 times. That is a tedeous task and it is known as Static Partition.
2. Dynamic Partition – Single insert to partition table
Inorder to achieve the same we need to set 4 things,
1. set hive.exec.dynamic.partition=true
This enable dynamic partitions, by default it is false.
2. set hive.exec.dynamic.partition.mode=nonstrict
We are using the dynamic partition without a static
partition (A table can be partitioned based
on multiple columns in hive) in such case we have to
enable the non strict mode. In strict mode we can use
dynamic partition only with a Static Partition.
3. set hive.exec.max.dynamic.partitions.pernode=3
The default value is 100, we have to modify the
same according to the possible no of partitions
4. hive.exec.max.created.files=150000
The default values is 100000 but for larger tables
it can exceed the default, so we may have to update the same.
INSERT OVERWRITE TABLE Unm_Parti PARTITION(department) SELECT EmployeeID, FirstName,Designation,Salary,department FROM Unm_Dup_Parti;
If the table is large enough the above query wont work seems like due to the larger number of files created on initial map task.
So in that cases group the records in your hive query on the map process and process them on the reduce side. You can implement the same in your hive query itself with the usage of DISTRIBUTE BY. Below is the query .
FROM Unm_Dup_Parti
INSERT OVERWRITE TABLE Unm_Parti PARTITION(department)
SELECT EmployeeID, FirstName,Designation,Salary,department DISTRIBUTE BY department;
With this approach you don’t need to overwrite the hive.exec.max.created.files parameter.
2. Partition on managed Data in HDFS
1. Data are filtered and seperated to different folders in HDFS
2. Create table with partition
create external table Unm_Parti (EmployeeID Int,FirstName String,Designation String,Salary Int) PARTITIONED BY (Department String) row format delimited fields terminated by "," ;
2. Load data into Unm_Parti table using ALTER statement
ALTER TABLE Unm_Parti ADD PARTITION (Department='A')
location '/user/unmesha/HIVE/HiveTrailFolder/A';
ALTER TABLE Unm_Parti ADD PARTITION (Department='B')
location '/user/unmesha/HIVE/HiveTrailFolder/B';
ALTER TABLE Unm_Parti ADD PARTITION (Department='C')
location '/user/unmesha/HIVE/HiveTrailFolder/C';