Tuesday, 2 December 2014

Hive Bucketed Tables


In previous post we had seen how  to create partition tables in Hive.

Lets see how to create buckets in Hive table


The main difference between Hive partitioning and Bucketing is ,when we do partitioning, we create a partition for each unique value of the column. But there may be situation where we need to create lot of tiny partitions. But if you use bucketing, you can limit it to a number which you choose and decompose your data into those buckets. In hive a partition is a directory but a bucket is a file.



In hive, bucketing does not work by default. You will have to set following variable to enable bucketing. set hive.enforce.bucketing=true;


1. Creating a staging table to store your data

create external table stagingtbl (EmployeeID Int,FirstName String,Designation String,Salary Int,Department String) row format delimited fields terminated by "," location '/user/aibladmin/Hive'; 

2. Create bucketed table

create table emp_bucket (EmployeeID Int,FirstName String,Designation String,Salary Int,Department String) clustered by (department) into 3 buckets row format delimited fields terminated by ",";

3. Load data from stagingtbl to bucketed table

from stagingtbl insert into table emp_bucket 
       select employeeid,firstname,designation,salary,department;


4. Check how many data file have created in Hive metastore.


Lets check the table content in Hive warehouse




We can find 3 files in warehouse directory for department A,B and C.Each bucket contains unique values.

13 comments:

  1. Hi Sreeveni,
    Did you use bucket map join. can you explain usecase for bucket map join.explain with simple example


    Thanks
    Hareesh

    ReplyDelete
  2. For latest and updated Cloudera certification dumps in PDF format contact us at completeexamcollection@gmail.com.
    Refer our blog for more details http://completeexamcollection.blogspot.in/2015/04/cloudera-hadoop-certification-dumps.html

    ReplyDelete
  3. When i tried this i don't see all the 3buckets created only one is created with all data can you please explain me whether do i need to set anything other than what was mentioned here.

    ReplyDelete
  4. Hi sreeveni, Nice explanation,
    I need a small info when would we use exactly this bucketing concepts? real time scenarios can you explain pls?!
    Thanks
    Venu
    http://www.apachespark.in

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Thank you. Very helpful explanation for hive bucketing. you can also see the full details about hive partition and bucketing as well as the hadoop ecosystems in-depth with clear examples in the below link http://www.geoinsyssoft.com/hive-partition-bucketing/

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. very nice explanation..thanks for sharing..and visit our site for more on hadoop..
    http://bit.ly/2bZrnGP

    ReplyDelete
  10. its a very good explanation for hive bucketing..easy to learn..

    http://bit.ly/2dcxHPD

    ReplyDelete
  11. Good Explanation..One question
    How the data distributed among buckets and
    How hive decides which department will move to which bucket ?

    ReplyDelete