Showing posts with label HDFS. Show all posts
Showing posts with label HDFS. Show all posts

Monday, 8 September 2014

How To Set Counters In Hadoop MapReduce

Counters are a useful channel for gathering statistics about the job: for quality control or for application level-statistics.Lets see an example where Counters count the no of keys processed in reducer.


3 key points to set

1. Define counter in Driver class

public class CounterDriver extends Configured implements Tool{
 long c = 0;
 static enum UpdateCount{
  CNT
 }
 public static void main(String[] args) throws Exception{
     Configuration conf = new Configuration();
     int res = ToolRunner.run(conf, new CounterDriver(), args);
     System.exit(res);
  }
 public int run(String[] args) throws Exception {

2. Increment or set counter in Reducer

public class CntReducer extends Reducer<IntWritable, Text, IntWritable, Text>{
 public void reduce(IntWritable key,Iterable<Text> values,Context context)  {
      //do something
      context.getCounter(UpdateCount.CNT).increment(1);
 }
}

3. Get counter in Driver class 

public class CounterDriver extends Configured implements Tool{
 long c = 0;
 static enum UpdateCount{
  CNT
 }
 public static void main(String[] args) throws Exception{
     Configuration conf = new Configuration();
     int res = ToolRunner.run(conf, new CounterDriver(), args);
     System.exit(res);
  }
 public int run(String[] args) throws Exception {
 .
 .
 .
 job.setInputFormatClass(TextInputFormat.class);
 job.setOutputFormatClass(TextOutputFormat.class);
 FileInputFormat.setInputPaths(job,in );
 FileOutputFormat.setOutputPath(job, out);
 job.waitForCompletion(true);
 c = job.getCounters().findCounter(UpdateCount.CNT).getValue();
 }
}

Full code :  GitHub

You will be able to see the counters in console also.






Wednesday, 30 April 2014

How To Create Tables In HIVE


Hive provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface.

You can create table in two different ways.

1. Create External table 

CREATE EXTERNAL TABLE students
(id INT, name STRING, batch STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' #supply delimiter
LOCATION '/user/hdfs/students'; 
For External Tables Hive does not move the data into its warehouse directory. If the external table is dropped, then the table metadata is deleted but not the data.

2. Create Normal Table 
CREATE TABLE students
(id INT, name STRING, batch STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' #supply delimiter
LOCATION '/user/hddfs/students';
For Normal tables hive moves data into its warehouse directory. If the table is dropped, then the table metadata and the data will be deleted.

Tuesday, 8 April 2014

Custom Parameters To Pig Script


There may be scenerios where we need to make our custom pig scripts, which can take any arguments.

Below is an sample code for a Custom Pig Script.

Sample Pig Script

The "customparam.pig" loads an input with custom argument and generates a single field from the input bag to another bag and stores the new bag to HDFS.


Here the input,delimiter for input file,output and filed to seperate are given as custom arguments to Pig Scripts.
--customparam.pig
--load hdfs/local fs data
original = load '$input' using PigStorage('$delimiter');
--filter a specific field value into another bag 
filtered = foreach original generate $split; 
--storing data into hdfs/local fs
store filtered into '$output'; 

Pig Scripts can be run as Local or in MapReduce Mode.


Local Mode

pig -x local -f customparam.pig -param input=Pig.csv -param output=OUT/pig -param delimiter="," -param split='$1'

This is the sample "Pig.csv" file which is the custom input used in command line.The custom delimiter is ",".

Pig1,23.5,Matched
Pig2,6.88,Not Matched
Pig3,6.1,Not Matched

And seperating 2 nd column from the original bag to a new bag.Any field in Pig starts with $0,$1,$2,....So if we need to generate 2 nd column the split param should be "$1".


After executing the above command. The part file content will be

23.5
6.88
6.1

If the command is run in MapReduce mode the part file get stored in HDFS.