Tuesday, 2 December 2014

Hive Bucketed Tables


In previous post we had seen how  to create partition tables in Hive.

Lets see how to create buckets in Hive table


The main difference between Hive partitioning and Bucketing is ,when we do partitioning, we create a partition for each unique value of the column. But there may be situation where we need to create lot of tiny partitions. But if you use bucketing, you can limit it to a number which you choose and decompose your data into those buckets. In hive a partition is a directory but a bucket is a file.



In hive, bucketing does not work by default. You will have to set following variable to enable bucketing. set hive.enforce.bucketing=true;


1. Creating a staging table to store your data

create external table stagingtbl (EmployeeID Int,FirstName String,Designation String,Salary Int,Department String) row format delimited fields terminated by "," location '/user/aibladmin/Hive'; 

2. Create bucketed table

create table emp_bucket (EmployeeID Int,FirstName String,Designation String,Salary Int,Department String) clustered by (department) into 3 buckets row format delimited fields terminated by ",";

3. Load data from stagingtbl to bucketed table

from stagingtbl insert into table emp_bucket 
       select employeeid,firstname,designation,salary,department;


4. Check how many data file have created in Hive metastore.


Lets check the table content in Hive warehouse




We can find 3 files in warehouse directory for department A,B and C.Each bucket contains unique values.

33 comments:

  1. Replies
    1. article is nice,
      hive bucketing

      It is a way of dividing a table into related parts based on the values of partitioned columns. example date, city and department.
      It is helpful when the table has one or more Partition keys.
      Hive partition is a sub-directory in the table directory.
      It is a basic unit of data storage method used in Apache hive

      Delete
  2. Hi Sreeveni,
    Did you use bucket map join. can you explain usecase for bucket map join.explain with simple example


    Thanks
    Hareesh

    ReplyDelete
  3. For latest and updated Cloudera certification dumps in PDF format contact us at completeexamcollection@gmail.com.
    Refer our blog for more details http://completeexamcollection.blogspot.in/2015/04/cloudera-hadoop-certification-dumps.html

    ReplyDelete
  4. When i tried this i don't see all the 3buckets created only one is created with all data can you please explain me whether do i need to set anything other than what was mentioned here.

    ReplyDelete
  5. Hi sreeveni, Nice explanation,
    I need a small info when would we use exactly this bucketing concepts? real time scenarios can you explain pls?!
    Thanks
    Venu
    http://www.apachespark.in

    ReplyDelete
    Replies
    1. hive bucketing
      Advantages
      It reduces the query latency (delay) by scanning only relevant partitioned data instead of the whole data set. i.e No need to search entire table columns for a single record.
      It distributes execution load horizontally.
      Hive partition reduces the query processing time on the partitioned data set.
      It is suitable for low volume of data. e.g. Calculate the population of “Bangalore city” is very fast instead of searching entire population of world.

      Delete
  6. This comment has been removed by the author.

    ReplyDelete
  7. Thank you. Very helpful explanation for hive bucketing. you can also see the full details about hive partition and bucketing as well as the hadoop ecosystems in-depth with clear examples in the below link http://www.geoinsyssoft.com/hive-partition-bucketing/

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. very nice explanation..thanks for sharing..and visit our site for more on hadoop..
    http://bit.ly/2bZrnGP

    ReplyDelete
  11. its a very good explanation for hive bucketing..easy to learn..

    http://bit.ly/2dcxHPD

    ReplyDelete
  12. Good Explanation..One question
    How the data distributed among buckets and
    How hive decides which department will move to which bucket ?

    ReplyDelete
  13. Excellent blog on Hive Bucketed Tables. Thank you sharing you knowledge with us.
    Devops Training in Bangalore
    itEanz

    ReplyDelete
  14. Is there a way to re-create the same bucketed table again after droping it.
    like in partitioning we use msck repair to get our partitioned data.

    ReplyDelete
  15. Great read
    I feel extremely cheerful to have seen your website page and anticipate such a significant number of all the more engaging circumstances perusing here. Much appreciated yet again for every one of the points of interest.
    Best AWS Training in Bangalore

    ReplyDelete
  16. I have to voice my passion for your kindness giving support to those people that should have guidance on this important matter.

    AWS Training in Bangalore

    ReplyDelete