Thursday 11 December 2014

Computing Median In Hive

In statistics and probability theory, the median is the numerical value separating the higher half of a data sample, a population, or a probability distribution, from the lower half.

The median is the central point of a data set.

Consider the following data points: 1,4,5,6,7
The Median is "5".

Lets see how we will find median in Hive.

Consider a "test" table.
-------------------
|Name Age|
------------------
|A  23 |
|B    23 |
|C  20 |
------------------
hive> select * from test;
OK
A 23
B 23
C 20
Time taken: 4.219 seconds, Fetched: 3 row(s)

Lets say we are going to find the median for Age column in "test" table.
Our expected median is "23".

PERCENTILE(BIGINT col,0.5) function helps to compute median in hive.The 50th percentile would be the median.

Structure of  "test" table
hive> desc test;      
OK
firstname            string                                   
age                  int                                      
Time taken: 0.32 seconds, Fetched: 2 row(s)

Here we can see the column we are going to find median is in INT. We need to convert the column into BIGINT.

Lets try out the query
select percentile(cast(age as BIGINT), 0.5) from test; 
Here we casted age column into BIGINT.
hive> select percentile(cast(age as BIGINT), 0.5) from test1; 
Query ID = aibladmin_20141211140606_c61cb042-ed14-4048-8270-4cea1eece1c7 
Total jobs = 1 
Launching Job 1 out of 1 
.
.
OK 
23.0 
Time taken: 27.659 seconds, Fetched: 1 row(s)
23.0 is the expected result which is the median for [23,23,20].

Sunday 7 December 2014

Joining Two Files Using MultipleInput In Hadoop MapReduce - MapSide Join

There are cases where we need to get 2 files as input and join them based on id or something like that.
Two different large data can be joined in map reduce programming also. Joins in Map phase refers as Map side join, while join at reduce side called as reduce side join.  
MapSide can be achieved using MultipleInputFormat in Hadoop.

Say I have 2 files ,One file with EmployeeID,Name,Designation and another file with EmployeeID,Salary,Department.

File1.txt
1 Anne,Admin
2 Gokul,Admin
3 Janet,Sales
4 Hari,Admin

AND

File2.txt
1 50000,A
2 50000,B
3 60000,A
4 50000,C

We will try to join these files into one based on EmployeeID
The result we aim at is 

1 Anne,Admin,50000,A
2 Gokul,Admin,50000,B
3 Janet,Sales,60000,A
4 Hari,Admin,50000,C

Here in both file File1.txt,File2.txt we can see that we need to join the records based on id.  So the employeeId's are common.
We will write 2 map jobs to process these files.

Processing File1.txt
public void map(LongWritable k, Text value, Context context) throws IOException, InterruptedException
{
 String line=value.toString();
 String[] words=line.split("\t");
 keyEmit.set(words[0]);
 valEmit.set(words[1]);
 context.write(keyEmit, valEmit);
}

The above map job process File1.txt
String[] words=line.split("\t");
splits each line with \t space so words[0] will be the employeeId which we pass it as key and the rest as value.

eg: 1 Anne,Admin
words[0] = 1
words[1] = Anne,Admin

Or else you can also use KeyValueTextInputFormat.class as InputFormat. This class gives key as employeeId and the rest as value.
You dont need to split it.

Processing File2.txt
public void map(LongWritable k, Text v, Context context) throws IOException, InterruptedException
{
 String line=v.toString();
 String[] words=line.split(" ");
 keyEmit.set(words[0]);
 valEmit.set(words[1]);
 context.write(keyEmit, valEmit);
}

The above map job process File2.txt

eg: 1 50000,A
words[0] = 1
words[1] = 50000,A

If the files are of same delimiter and ID comes first you can resuse the same map job

Lets write a commomn Reducer task to join the data using key.
String merge = "";
public void reduce(Text key, Iterable<Text> values, Context context)
{
 int i =0;
 for(Text value:values)
 {
  if(i == 0){
   merge = value.toString()+",";
  }
  else{
   merge += value.toString();
  }
  i++;
 }
 valEmit.set(merge);
 context.write(key, valEmit);
}

Here we will be caching 1 data from a mapper and appends it to string "merge".
And emit employeeId as key and merge as value.

Now we need to furnish our Driver class to take 2 inputs and use MultipleInputFormat as InputFormat


public int run(String[] args) throws Exception {
 Configuration c=new Configuration();
 String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
 Path p1=new Path(files[0]);
 Path p2=new Path(files[1]);
 Path p3=new Path(files[2]);
 FileSystem fs = FileSystem.get(c);
 if(fs.exists(p3)){
  fs.delete(p3, true);
  }
 Job job = new Job(c,"Multiple Job");
 job.setJarByClass(MultipleFiles.class);
 MultipleInputs.addInputPath(job, p1, TextInputFormat.class, MultipleMap1.class);
 MultipleInputs.addInputPath(job,p2, TextInputFormat.class, MultipleMap2.class);
 job.setReducerClass(MultipleReducer.class);
 .
 .
}

MultipleInputs.addInputPath(job, p1, TextInputFormat.class, MultipleMap1.class);
MultipleInputs.addInputPath(job,p2, TextInputFormat.class, MultipleMap2.class);
p1,p2 are the Path variable holding 2 input files.
You can find the code in Github

There is one more case where we can make our output in a sequential manner.
Say if we need to get the output as below
1 Anne,Admin,50000,A
2 Gokul,Admin,50000,B
3 Janet,Sales,60000,A
4 Hari,Admin,50000,C
Inorder to achieve the same we can make use of TextPair Writable concepts in Hadoop.
You can find the working code in github . Thanks to one of  my blog reader Ravi Kumar who sorted out the sequence in order.

Tuesday 2 December 2014

Hive Bucketed Tables


In previous post we had seen how  to create partition tables in Hive.

Lets see how to create buckets in Hive table


The main difference between Hive partitioning and Bucketing is ,when we do partitioning, we create a partition for each unique value of the column. But there may be situation where we need to create lot of tiny partitions. But if you use bucketing, you can limit it to a number which you choose and decompose your data into those buckets. In hive a partition is a directory but a bucket is a file.



In hive, bucketing does not work by default. You will have to set following variable to enable bucketing. set hive.enforce.bucketing=true;


1. Creating a staging table to store your data

create external table stagingtbl (EmployeeID Int,FirstName String,Designation String,Salary Int,Department String) row format delimited fields terminated by "," location '/user/aibladmin/Hive'; 

2. Create bucketed table

create table emp_bucket (EmployeeID Int,FirstName String,Designation String,Salary Int,Department String) clustered by (department) into 3 buckets row format delimited fields terminated by ",";

3. Load data from stagingtbl to bucketed table

from stagingtbl insert into table emp_bucket 
       select employeeid,firstname,designation,salary,department;


4. Check how many data file have created in Hive metastore.


Lets check the table content in Hive warehouse




We can find 3 files in warehouse directory for department A,B and C.Each bucket contains unique values.

Monday 1 December 2014

How To Drop A Particular Partition in HIVE


Hive Partition can be dropped using  

ALTER TABLE Tablename DROP IF EXISTS
 PARTITION(PartitionedID=PartitionVALUE);

Lets see an example.
Say I have an emp Hive Table where there are 3 partitions for Department(A,B,C).
Inorder to delete a particular Department use the below query.
ALTER TABLE emp DROP IF EXISTS
  PARTITION(Department='A');

Tuesday 25 November 2014

[SOLVED] FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations in hive-0.14.0


CRUD operations are supported in Hive from 0.14 onwards.
See Wiki 

Hive supports data warehouse software facility,which facilitates querying and managing large datasets residing in distributed storage. In data warehouse there are situation where we need to update, delete etc transactions.In hive later versions UPDATE was not supported,but there were workarounds to do update a transaction

1. Update Statement In Hive For Small Tables
2. Update Statement In Hive For Large Tables using INSERT


Lets see how to do INSERT,UPDATE,DELETE in newer version of hive. 

Create a table "test"
CREATE EXTERNAL TABLE 
    test (EmployeeID Int,FirstName String,Designation  
        String,Salary Int,Department String) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY  "," 
    LOCATION '/user/hdfs/Hive';
We will try to update the salary of employee id 19 from 45,000 to 50,000.
 hive> UPDATE test 
           SET salary = 50000 
           WHERE employeeid = 19;

 FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction m anager that does not support these operations.

While applying above query it shows a semantic Exception.In order to allow update and delete we need to add additional settings in hive-site.xml and create table with ACID output format support.

To achieve the same follow below steps:

1. New Configuration Parameters for Transactions
 hive.support.concurrency – true
 hive.enforce.bucketing – true
 hive.exec.dynamic.partition.mode – nonstrict
 hive.txn.manager –org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
 hive.compactor.initiator.on – true
 hive.compactor.worker.threads – 1
You can set these configuration in hive-site.xml (after setting restart Hive ) for ever or via terminal.
Dont Forget to restart Hive once the above settings are applied, else you will get the same error again.
2. Below query creates HiveTest table with ACID support
(To do Update,delete or Insert we need to create a table that support ACID properties)
 create table HiveTest 
   (EmployeeID Int,FirstName String,Designation String,
     Salary Int,Department String) 
   clustered by (department) into 3 buckets 
   stored as orc TBLPROPERTIES ('transactional'='true') ;
3. Load data into HiveTest from a staging table,which contains the original data.
 from stagingtbl 
   insert into table HiveTest 
   select employeeid,firstname,designation,salary,department;

4. UPDATE,DELETE and INSERT operations


1.UPDATE
 update HiveTest 
    set salary = 50000 
    where employeeid = 19; 

SYNOPSIS

  1. The referenced column must be a column of the table being updated.
  2. The value assigned must be an expression that Hive supports in the select clause.  Thus arithmetic operators, UDFs, casts, literals, etc. are supported.  Subqueries are not supported.
  3. Only rows that match the WHERE clause will be updated.
  4. Partitioning columns cannot be updated.
  5. Bucketing columns cannot be updated.
  6. In Hive 0.14, upon successful completion of this operation the changes will be auto-committed.


2. INSERT
 insert into table HiveTest 
     values(21,'Hive','Hive',0,'B');

SYNOPSIS

  1. Each row listed in the VALUES clause is inserted into table tablename.
  2. Values must be provided for every column in the table.  The standard SQL syntax that allows the user to insert values into only some columns is not yet supported.  To mimic the standard SQL, nulls can be provided for columns the user does not wish to assign a value to.
  3. Dynamic partitioning is supported in the same way as for INSERT...SELECT.
  4. If the table being inserted into supports ACID and a transaction manager that supports ACID is in use, this operation will be auto-committed upon successful completion.



3. DELETE
 delete from HiveTest
     where employeeid=19;

SYNOPSIS
  1. Only rows that match the WHERE clause will be deleted.
  2. In Hive 0.14, upon successful completion of this operation the changes will be auto-committed.

Tuesday 18 November 2014

Update Statement In Hive For Large Tables


Hive Version used - hive-0.12.0

In Previous Blog  we have seen creating and loading data into partition table.
Now we will try to update one record using INSERT statement as hive doesnt support UPDATE command. In newer version of hive, UPDATE command will be added.

 We will see an example for updating Salary of employee id 19 to 50,000

INSERT INTO TABLE Unm_Parti PARTITION (Department = 'A') SELECT employeeid,firstname,designation, CASE WHEN employeeid=19 THEN 50000 ELSE salary END AS salary FROM Unm_Parti Where employeeid=19;
Using the above command your hive record get updated.

From hive-0.14 onwards UPDATE is available.
How to use CURD operations in hive-0.14.0

Hive Partitioning

 Partitions are horizontal record of data which allows large datasets to get seperated into more managable chunks. In Hive, partitioning is supported for both managed dataset in folders and for external tables also.


1. Hive partition for external tables

  1. Load data into HDFS

       Data resides in /user/unmesha/HiveTrail/emp.txt. The file emp.txt is a sample employee data.


1,Anne,Admin,50000,A
2,Gokul,Admin,50000,B
3,Janet,Sales,60000,A
4,Hari,Admin,50000,C
5,Sanker,Admin,50000,C
6,Margaret,Tech,12000,A
7,Nirmal,Tech,12000,B
8,jinju,Engineer,45000,B
9,Nancy,Admin,50000,A
10,Andrew,Manager,40000,A
11,Arun,Manager,40000,B
12,Harish,Sales,60000,B
13,Robert,Manager,40000,A
14,Laura,Engineer,45000,A
15,Anju,Ceo,100000,B
16,Aarathi,Manager,40000,B
17,Parvathy,Engineer,45000,B
18,Gopika,Admin,50000,B
19,Steven,Engineer,45000,A
20,Michael,Ceo,100000,A

We are going to partition this dataset into 3 Departments A,B,C


 2. Create a non partioned table to store the data (Staging table)

create external table Unm_Dup_Parti (EmployeeID Int,FirstName String,Designation  String,Salary Int,Department String) row format delimited fields terminated by "," location '/user/unmesha/HiveTrail';

3. Create Partitioned hive table
create  table Unm_Parti (EmployeeID Int,FirstName String,Designation  String,Salary Int) PARTITIONED BY (Department String) row format delimited fields terminated by ","; 
Here we are creating partition for Department by using PARTITIONED BY.

4. Insert data into Partitioned table, by using select clause

       There are 2 ways to insert data into partition table.

 1. Static Partition - Using individual insert
INSERT INTO TABLE Unm_Parti PARTITION(department='A') 
SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='A'; 

INSERT INTO TABLE Unm_Parti PARTITION (department='B') 
SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='B'; 

INSERT INTO TABLE Unm_Parti PARTITION (department='C') 
SELECT EmployeeID, FirstName,Designation,Salary FROM Unm_Dup_Parti WHERE department='C';

  If we go for the above approach , if we have 50 partitions we need to do the insert statement 50 times. That is a tedeous task and it is known as Static Partition.

 2. Dynamic Partition – Single insert to partition table
             Inorder to achieve the same we need to set 4 things,
1. set hive.exec.dynamic.partition=true
     This enable dynamic partitions, by default it is false.
2. set hive.exec.dynamic.partition.mode=nonstrict
     We are using the dynamic partition without a static
     partition (A table can be partitioned based    
     on multiple columns in hive) in such case we have to           
     enable the non strict mode. In strict mode we can use             
     dynamic partition  only with a Static Partition.
3. set hive.exec.max.dynamic.partitions.pernode=3
     The default value is 100, we have to modify the   
     same according to the possible no of partitions
4. hive.exec.max.created.files=150000
     The default values is 100000 but for larger tables  
     it can exceed the default, so we may have to update the same.            
INSERT OVERWRITE TABLE Unm_Parti PARTITION(department) SELECT EmployeeID, FirstName,Designation,Salary,department FROM Unm_Dup_Parti; 

If the table is large enough the above query wont work seems like due to the larger number of files created on initial map task. 

So in that cases group the records in your hive query on the map process and process them on the reduce side. You can implement the same in your hive query itself with the usage of DISTRIBUTE BY. Below is the query .
FROM Unm_Dup_Parti 
INSERT OVERWRITE TABLE Unm_Parti PARTITION(department) 
SELECT EmployeeID, FirstName,Designation,Salary,department DISTRIBUTE BY department;
With this approach you don’t need to overwrite the hive.exec.max.created.files parameter.


2. Partition on managed Data in HDFS


 1. Data are filtered and seperated to different folders in HDFS

2. Create table with partition

create external table Unm_Parti (EmployeeID Int,FirstName String,Designation  String,Salary Int) PARTITIONED BY (Department String) row format delimited fields terminated by "," ;

 2. Load data into Unm_Parti table using ALTER statement

ALTER TABLE Unm_Parti ADD PARTITION (Department='A')
location '/user/unmesha/HIVE/HiveTrailFolder/A';

ALTER TABLE Unm_Parti ADD PARTITION (Department='B')
location '/user/unmesha/HIVE/HiveTrailFolder/B';

ALTER TABLE Unm_Parti ADD PARTITION (Department='C')
location '/user/unmesha/HIVE/HiveTrailFolder/C';



Sunday 16 November 2014

Update Statement In Hive For Small Tables


Let's see how to update small Hive tables.


1. Create a  table and load data (Assuming the data is placed in HDFS)

You can also refer Previous Post for creating hive tables.


CREATE EXTERNAL TABLEe Non_Parti(EmployeeID Int,FirstName String,Designation String,Salary Int,Department String) ROW FORMAT DELIMITED FIELDS TERMINATED BY  "," LOCATION '/user/hdfs/Hive'; 


hive> select * from Non_Parti;
OK
1 Anne Admin 50000 A
2 Gokul Admin 50000 B
3 Janet Sales 60000 A
4 Hari Admin 50000 C
5 Sanker Admin 50000 C
6 Margaret Tech 12000 A
7 Nirmal Tech 12000 B
8 jinju Engineer 45000 B
9 Nancy Admin 50000 A
10 Andrew Manager 40000 A
11 Arun Manager 40000 B
12 Harish Sales 60000 B
13 Robert Manager 40000 A
14 Laura Engineer 45000 A
15 Anju Ceo 100000 B
16 Aarathi Manager 40000 B
17 Parvathy Engineer 45000 B
18 Gopika Admin 50000 B
19 Steven Engineer 45000 A
20 Michael Ceo 100000 A
Time taken: 0.233 seconds, Fetched: 20 row(s)


2. Updating Department of employeeid 19 's to C.


INSERT OVERWRITE TABLE Non_Parti SELECT employeeid,firstname,designation,salary, CASE WHEN employeeid=19 THEN 'C' ELSE department END AS department FROM Non_Parti;


hive> select * from Non_Parti;
OK
1 Anne Admin 50000 A
2 Gokul Admin 50000 B
3 Janet Sales 60000 A
4 Hari Admin 50000 C
5 Sanker Admin 50000 C
6 Margaret Tech 12000 A
7 Nirmal Tech 12000 B
8 jinju Engineer 45000 B
9 Nancy Admin 50000 A
10 Andrew Manager 40000 A
11 Arun Manager 40000 B
12 Harish Sales 60000 B
13 Robert Manager 40000 A
14 Laura Engineer 45000 A
15 Anju Ceo 100000 B
16 Aarathi Manager 40000 B
17 Parvathy Engineer 45000 B
18 Gopika Admin 50000 B
19 Steven Engineer 45000 C
20 Michael Ceo 100000 A
Time taken: 0.184 seconds, Fetched: 20 row(s)

Your Hive table is now updated. This can be done for small tables only.If you need to update large tables we need to partition Hive tables.

*In newer version of Hive update will be included.

Monday 3 November 2014

K-Means Clustering in Mahout


Example shows Cloudera mahout (Hadoop 2.0.0-cdh4.5.0 with mahout-0.7)


1. Download the input data set


unmesha@client:~$ wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data

2. Place the data into HDFS under "testdata"
unmesha@client:~$ hadoop fs -mkdir testdata
unmesha@client:~$ echo $MAHOUT_HOME
/usr/lib/mahout/bin
unmesha@client:~$ $HADOOP_HOME/bin/hadoop fs -put /PATH/TO/synthetic_control.data testdata

*HDFS input directory name should be “testdata”



Run Kmeans Clustering

unmesha@client:~$ $MAHOUT_HOME/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

The result get stored in HDFS with "output" foldername

unmesha@client:~$ hadoop fs -ls output
Found 14 items
-rwxr-xr-x   1 unmesha unmesha        194 2014-11-04 09:06 output/_policy
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusteredPoints
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-0
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-1
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-10-final
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-2
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-3
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-4
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-5
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-6
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-7
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-8
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/clusters-9
drwxrwxr-x   - unmesha unmesha       4096 2014-11-04 09:06 output/data

The clustering output is in SequenceFile format which is not human readable. Mahout has a utility known as clusterdump which converts into human readable format.


Copy the cluster output from HDFS onto your local file system


unmesha@client:~$ hadoop fs -mkdir kmeansoutput
unmesha@client:~$ hadoop fs -get output kmeansoutput

unmesha@client:~$ mahout clusterdump --input output/clusters-10-final --pointsDir output/clusteredPoints --output kmeansoutput/clusteranalyze.txt

You can view the results now in kmeansoutput/clusteranalyze.txt



Sunday 2 November 2014

How To Install Apache Mahout on Ubuntu


Prerequisites:

  1.  Hadoop Cluster
  2.  Maven


STEP 1: Download mahout latest source code from

http://www.apache.org/dyn/closer.cgi/lucene/mahout/

Make sure you download .src zipped file.


STEP 2: Unzip the file to a named folder “mahout”

unzip -a mahout-distribution-x.x-src.zip

STEP 3: Move mahout to /usr/local

mv mahout /usr/local

STEP 4: Build Mahout

unmesha@client:~$ cd /usr/local/mahout/mahout-distribution-0.9
unmesha@client:/usr/local/mahout/mahout-distribution-0.9$ ls
bin         core          examples     LICENSE.txt  math-scala  pom.xml     src buildtools  distribution  integration  math         NOTICE.txt  README.txt  target
unmesha@client:/usr/local/mahout/mahout-distribution-0.9$mvn install

Wait untill mahout is build. It would perform some tests also.It is recommended to complete the test for the first time.Later you can skip the test using

mvn install -Dmaven.test.skip=true

Once the tests are done and the mahout is built , we get a success message.


Congratz Apache Mahout is installed...


If you are using Cloudera(CDH) package , you can install Mahout in just 1 step.
apt-get install mahout

You can use mahout commands in /usr/bin and if you want to run mahout in hadoop cluster go to /usr/lib and reference mahout-cdhx-core-job.jar and full class path.



Friday 24 October 2014

How to load a file in DistributedCache in Hadoop MapReduce



We can load an extra file using Distributed Cache.To do that we need to configure the Distributed Cache with needed file in Driver Class


Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path cachefile = new Path("path/to/file");
FileStatus[] list = fs.globStatus(cachefile);
for (FileStatus status : list) {
 DistributedCache.addCacheFile(status.getPath().toUri(), conf);
}
And in Reducers setup() or Mappers Setup() we will be able to read this file.
public void setup(Context context) throws IOException{
 Configuration conf = context.getConfiguration();
 FileSystem fs = FileSystem.get(conf);
 URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
 Path getPath = new Path(cacheFiles[0].getPath());  
 BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
 String setupData = null;
 while ((setupData = bf.readLine()) != null) {
   System.out.println("Setup Line in reducer "+setupData);
 }
}
You can give 0,1,... if you supplied more than 1 cache file
Path getPath = new Path(cacheFiles[1].getPath());  

Happy Hadooping ....

Monday 29 September 2014

Comments On CCD-410 Sample Dumps


What do you think of these three questions mentioned in site CCD-410 Practice Exam Questions Demo 100% Pass-Guaranteed or Your Money Back!!!


QUESTION: 3

What happens in a MapReduce job when you set the number of reducers to one?

A. A single reducer gathers and processes all the output from all the mappers. The output is written in as many separate files as there are mappers.
B. A single reducer gathers and processes all the output from all the mappers. The output is written to a single file in HDFS.
C. Setting the number of reducers to one creates a processing bottleneck, and since the number of reducers as specified by the programmer is used as a reference value only, the MapReduceruntime provides a default setting for the number of reducers.
D. Setting the number of reducers to one is invalid, and an exception is thrown.

Answer:A

QUESTION: 4

In the standard word count MapReduce algorithm, why might using a combiner reduce theoverall Job running time?

A. Because combiners perform local aggregation of word counts, thereby allowing the mappers to process input data faster.
B. Because combinersperform local aggregation of word counts, thereby reducing the number of mappers that need to run.
C. Because combiners perform local aggregation of word counts, and then transfer that data toreducers without writing the intermediate data to disk.
D. Because combiners perform local aggregation of word counts, thereby reducing the number of key-value pairs that need to be snuff let across the network to the reducers.

Answer:A

QUESTION: 5

You have user profile records in your OLTP database,that you want to join with weblogs you have already ingested into HDFS.How will you obtain these user records?

A. HDFS commands
B. Pig load
C. Sqoop import
D. Hive

Answer :B


Correct Answers

QUESTION 3: Answer B
QUESTION 4: Answer D
QUESTION 5: Answer C

See reviews on correct answer

Can we change the default key-value input seperator in Hadoop MapReduce

Previous Post


Yes, We can change it using "key.value.separator.in.input.line" property in Driver class.

There may be cases where we need to take each line with specific delimiter.


eg:
one	first line
two	second line
Inorder to read a file like this we will be using KeyValueTextInputFormat.class as it takes the line with TAB  as default seperator.

So while printing each line in map() , the key will be "one" and value will be "first line".


What if we need other delimiters instead of TAB delimiter


eg:
one,first line
two,second line

Here also we need to get key as "one" and value as "first line".

It is possible by adding an extra configuration along with KeyValueTextInputFormat to change the default seperator.

//New API
Configuration conf = new Configuration();
conf.set("key.value.separator.in.input.line", ","); 
Job job = new Job(conf);
job.setInputFormatClass(KeyValueTextInputFormat.class);

Monday 22 September 2014

Matrix Multiplication in Hadoop MapReduce

Matrix multiplication is common and important algebraic operation.While coming to MapReduce Paradigm, If we give 2 matrices for multiplication, once we get the data from HDFS and process them in map() we only get one input split.We cannot make sure which row is that. So butter way to do that is to order your data with row and column index.


Matrix multiplication is applied to file of format:MatrixName,row,col,element

A,0,1,1.0
A,0,2,2.0
A,0,3,3.0
A,0,4,4.0
B,3,1,10.0
B,3,2,11.0
A,1,0,5.0
A,1,1,6.0
A,1,2,7.0
A,1,3,8.0
A,1,4,9.0
B,0,1,1.0
B,0,2,2.0
B,1,0,3.0
B,1,1,4.0
B,1,2,5.0
B,2,0,6.0
B,2,1,7.0
B,2,2,8.0
B,3,0,9.0
B,4,0,12.0
B,4,1,13.0
B,4,2,14.0

Find code : GitHub




Monday 8 September 2014

How To Set Counters In Hadoop MapReduce

Counters are a useful channel for gathering statistics about the job: for quality control or for application level-statistics.Lets see an example where Counters count the no of keys processed in reducer.


3 key points to set

1. Define counter in Driver class

public class CounterDriver extends Configured implements Tool{
 long c = 0;
 static enum UpdateCount{
  CNT
 }
 public static void main(String[] args) throws Exception{
     Configuration conf = new Configuration();
     int res = ToolRunner.run(conf, new CounterDriver(), args);
     System.exit(res);
  }
 public int run(String[] args) throws Exception {

2. Increment or set counter in Reducer

public class CntReducer extends Reducer<IntWritable, Text, IntWritable, Text>{
 public void reduce(IntWritable key,Iterable<Text> values,Context context)  {
      //do something
      context.getCounter(UpdateCount.CNT).increment(1);
 }
}

3. Get counter in Driver class 

public class CounterDriver extends Configured implements Tool{
 long c = 0;
 static enum UpdateCount{
  CNT
 }
 public static void main(String[] args) throws Exception{
     Configuration conf = new Configuration();
     int res = ToolRunner.run(conf, new CounterDriver(), args);
     System.exit(res);
  }
 public int run(String[] args) throws Exception {
 .
 .
 .
 job.setInputFormatClass(TextInputFormat.class);
 job.setOutputFormatClass(TextOutputFormat.class);
 FileInputFormat.setInputPaths(job,in );
 FileOutputFormat.setOutputPath(job, out);
 job.waitForCompletion(true);
 c = job.getCounters().findCounter(UpdateCount.CNT).getValue();
 }
}

Full code :  GitHub

You will be able to see the counters in console also.






Saturday 23 August 2014

Example for Apriori Algorithm


Lets take a store data
pen,pencil
pencil,book,eraser
pen,book,eraser,chalk
pen,eraser,chalk
pen,pencil,book
pen,pencil,book,eraser
pen,Ink
pen,pencil,book
pen,pencil,eraser
pencil,book,chalk
To start with Apriori follow the below steps.
Step 1: Initially we need to find Item 1 Frequent Dataset
c1
------
book 6
chalk 3
eraser 6
pen 8
pencil 7
Ink 1
We will say that an item set is frequent if it appears in at least 3 transactions of the itemset: the value 3 is the support threshold.

Support count = 3 (user defined)

So the items less that support count can be discarded form F1 frequent Dataset.
so our new set will be
L1
------
book 6
chalk 3
eraser 6
pen 8
pencil 7
Step 2: We need to generate size 2 frequent item pair sets by joining L1 set
eg:{book} U {chalk} => {book,chalk} and so on..
{book,chalk}
{book,eraser}
{book,pen}
{book,pencil}


{chalk,eraser} 
{chalk,pen} 
{chalk,pencil}

{eraser,pen} 
{eraser,pencil} 

{pen,pencil}
Once the transactions are joined we need to identify the no occurence of the above data items in original transaction(That will be the support count of C2)
C2
----------------
{book,chalk} 2
{book,eraser} 2
{book,pen} 4
{book,pencil} 5


{chalk,eraser} 2
{chalk,pen} 2
{chalk,pencil} 0

{eraser,pen} 5
{eraser,pencil} 3

{pen,pencil} 5
Transactions less that support count can be discarded form C2 frequent Dataset
L2
----------------
{book,pen} 4
{book,pencil} 5
{eraser,pen} 5
{eraser,pencil} 3
{pen,pencil} 5
To find C3 loop through L2
eg: {book,pen} U {book,pencil} => {book,pen,pencil}
C3
-------------------------
{book,pen,pencil} 3
{chalk,eraser,pen} 2
{eraser,pen,pencil} 2
Transactions less that support count can be discarded form C3 frequent Dataset
L3
-------------------------
{book,pen,pencil} 3
There are no transaction to join further.
So our Frequent item sets are
L1:
-------
book 6
chalk 3
eraser 6
pen 8
pencil 7

L2:
-----------------
{book,pen} 4
{book,pencil} 5
{eraser,pen} 5
{eraser,pencil} 3
{pen,pencil} 5


L3
-------------------------
{book,pen,pencil} 3
Step 3: We need to generate Strong Assosiaction  Rules for frequent Set using L1,L2and L3

Say confidence is 60% and Support count is 3.So we have to find the Transactions with no.of item 3  and which has a confidence >=60.Now we can identify L3 set
{book,pen,pencil} 3

Finding Ruleset
{book,pen} => pencil
{book,pencil} => pen
{pen,pencil} => book

pencil => {book,pen}
pen => {book,pencil}
book => {pen,pencil}
Now we need to find the confidence of each transaction
eg: {book,pen} => pencil
           = support Cnt{book,pen,pencil}/ support count({pencil})

Therefore rules having confidence greater than and equal to 60 are
book,pen=>pencil 75.0
book,pencil=>pen 60.0
pen,pencil=>book 60.0
These are the strongest rules.
If a customer buys book and pen he have a tendency to buy a pencil too. Like wise if he buys book and pencil he may buy pen too.