Wednesday, 30 April 2014

Hadoop Installation For Beginners - Pseudo Distributed Mode (Single Node Cluster)


Hadoop is an open-source software framework which is capable to store large amount of data and processing those bigdata.The underlying technology was invented by Google back in their earlier days. Hadoop was part of an open source project Nutch developed by Yahoo.Later Hadoop was spun out from Nutch Search Engine.
Hadoop is able to handle a Large amout of data.

Hadoop is comprised of 2 components
1. HDFS for storage
2. MapReduce for processing data in HDFS

Hadoop can be installed in 3 diffrent ways

1. Standalone Mode

Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
   $ mkdir input
   $ cp conf/*.xml input
   $ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
   $ cat output/*

2. Pseudo Distributed Mode or Single Node Cluster

Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process with one node.

3. Multi Node Cluster

A few nodes to extremely large clusters with thousands of nodes.

Below installation explains how to install hadoop in pseudo distributed Mode.


Prerequisite

1. Java (Latest Version)

> sudo add-apt-repository ppa:webupd8team/java
> sudo apt-get update
> sudo apt-get install oracle-java7-installer

2. SSH

> apt-get install ssh
> ssh localhost
[sudo]password:
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-generic x86_64)
 * Documentation:  https://help.ubuntu.com/
Last login: Tue Apr 29 17:48:55 2014 from amma-hp-probook-4520s.local


Configuring Passwordless SSH

In pseudo-distributed mode, we have to start daemons, and to do that, we need to have SSH installed. Hadoop doesn’t actually distinguish between pseudo-distributed and fully distributed modes: it merely starts daemons on the set of hosts in the cluster (defined by the slaves file) by SSH-ing to each host and starting a daemon process. Pseudo-distributed mode is just a special case of fully distributed mode in which the (single) host is localhost, so we need to make sure that we can SSH to localhost and log in without having to enter a password.

If you cannot ssh to localhost without a passphrase, execute the following commands:

unmesha@unmesha-hadoop-virtual-machine:~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/unmesha/.ssh/id_rsa): [press enter]
Enter passphrase (empty for no passphrase): [press enter]
Enter same passphrase again: [press enter]
Your identification has been saved in /home/unmesha/.ssh/id_rsa.
Your public key has been saved in /home/unmesha/.ssh/id_rsa.pub.
The key fingerprint is:
61:c5:33:9f:53:1e:4a:5f:e9:4d:19:87:55:46:d3:6b unmesha@unmesha-virtual-machine
The key's randomart image is:
+--[ RSA 2048]----+
|         ..    *%|
|         .+ . ++*|
|        o  = *.+o|
|       . .  = oE.|
|        S    ..  |
|                 |
|                 |
|                 |
|                 |
+-----------------+

unmesha@unmesha-hadoop-virtual-machine:~$ ssh-copy-id localhost
unmesha@localhost's password: 
Now try logging into the machine, with "ssh 'localhost'", and check in:

  ~/.ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

Now you will be able to ssh without password

unmesha@unmesha-hadoop-virtual-machine:~$ ssh localhost
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

Last login: Tue Apr 29 17:48:55 2014 from amma-hp-probook-4520s.local
unmesha@unmesha-virtual-machine:~$ 


Setting JAVA_HOME


Before running Hadoop, we need to tell where Java is located on your system. If you have the JAVA_HOME environment variable set to point to a suitable Java installation, that will be used, and you don’t have to configure anything further. Otherwise, you can set the Java installation that Hadoop uses by editing certain configuration file, and specifying the JAVA_HOME variable.

unmesha@unmesha-hadoop-virtual-machine:~$ java -version
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
Check your current location of java 
unmesha@unmesha-hadoop-virtual-machine:~$ sudo update-alternatives --config java
[sudo] password for unmesha: 
There is only one alternative in link group java: /usr/lib/jvm/java-7-oracle/jre/bin/java
Nothing to configure.

If you have only one alternative it will only display as above else this command lists all the alternatives with a * symbol to the current installed location.
Next copy the path (before /jre/bin)and set it in ~/.bashrc 
unmesha@unmesha-hadoop-virtual-machine:~$ vi ~/.bashrc
Note: Check if you are able to type into profile else 
apt-get install vim

and add then set java home to last line.

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

Navigate to another terminal or refresh profile

unmesha@unmesha-hadoop-virtual-machine:~$ source ~/.bashrc 
Check if you are able to echo  JAVA_HOME
unmesha@unmesha-hadoop-virtual-machine:~$ echo $JAVA_HOME
/usr/lib/jvm/java-7-oracle

Hadoop Installation


Download a latest version of hadoop from Apache Mirrors.

Download the latest stable version.
Downloading: hadoop-2.3.0.tar.gz


Untarring the file

unmesha@unmesha-hadoop-virtual-machine:~$ tar xvfz hadoop-2.3.0.tar.gz 
unmesha@unmesha-hadoop-virtual-machine:~$ cd hadoop-2.3.0/
unmesha@unmesha-hadoop-virtual-machine:~/hadoop-2.3.0$ ls
bin  include  libexec      NOTICE.txt  sbin
etc  lib      LICENSE.txt  README.txt  share

Move hadoop-2.3.0 to hadoop
unmesha@unmesha-hadoop-virtual-machine:~$ sudo mv hadoop-2.3.0 /usr/local/hadoop

Set below contents into ~/.bashrc 
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
unmesha@unmesha-hadoop-virtual-machine:~$ source ~/.bashrc 

Configuration

We need to configure 5 files  

1. core-site.xml  
2. mapred-site.xml
3. hdfs-site.xml
4. hadoop-env.sh
5. yarn-site.xml

1. core-site.xml

unmesha@unmesha-hadoop-virtual-machine:~$ vi /usr/local/hadoop/etc/hadoop/core-site.xml
unmesha@unmesha-hadoop-virtual-machine:~/hadoop/hadoop-2.3.0/etc/hadoop$ vi core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property>
   <name>fs.default.name</name>
   <value>hdfs://localhost:9000</value>
</property>
</configuration>

2. mapred-site.xml

By default, the /usr/local/hadoop/etc/hadoop/ folder contains the /usr/local/hadoop/etc/hadoop/mapred-site.xml.template file which has to be renamed/copied with the name mapred-site.xml. This file is used to specify which framework is being used for MapReduce.

unmesha@unmesha-hadoop-virtual-machine:~/$ vi cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
unmesha@unmesha-hadoop-virtual-machine:~$ vi /usr/local/hadoop/etc/hadoop/mapred-site.xml
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
</property>
 </configuration>

Create two folders for namenode and datanode (Dont use sudo for creating mkdir)


mkdir -p /usr/local/hadoop_store/hdfs/namenode
mkdir -p /usr/local/hadoop_store/hdfs/datanode

3. hdfs-site.xml

unmesha@unmesha-hadoop-virtual-machine:~/$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet href="configuration.xsl"?>
 <configuration>
 <property>
   <name>dfs.replication</name>
   <value>1</value>
 </property>
 <property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
 </property>
 <property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
 </property>
</configuration>

4. hadoop-env.sh

unmesha@unmesha-hadoop-virtual-machine:~$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh

# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

5. yarn-site.xml


unmesha@unmesha-hadoop-virtual-machine:~/$ vi /usr/local/hadoop/etc/hadoop/yarn-site.xml
<?xml version="1.0"?>
<?xml-stylesheet href="configuration.xsl"?>
 <configuration>
 <property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
 </property>
 <property>
   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
 </property>
</configuration>

Now Format the namenode (Only done once)

unmesha@unmesha-hadoop-virtual-machine:~/$hdfs namenode -format

You will see some thing like this

......14/04/30 12:37:42 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
14/04/30 12:37:42 INFO util.ExitUtil: Exiting with status 0
14/04/30 12:37:42 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at unmesha-virtual-machine/127.0.1.1
************************************************************/

Now we will start all the demoons

unmesha@unmesha-hadoop-virtual-machine:~/$start-dfs.sh
unmesha@unmesha-hadoop-virtual-machine:~/$start-yarn.sh
To check what all daemons are running type "jps"
unmesha@unmesha-hadoop-virtual-machine:~/$jps
2243 NodeManager
2314 ResourceManager
1923 DataNode
2895 SecondaryNameNode
1234 Jps
1788 NameNode

In Hadoop there are 2 locations

1. User's HDFS

 (Optional)

 Setting hadoop users location. (Try with sudo -u hdfs command or hadoop fs command)

sudo -u hdfs hadoop fs -mkdir /user/<your username> 
sudo -u hdfs hadoop fs -chown <user> /user/<your username> 
  OR
hadoop fs -mkdir /user/<your username> 
hdfs hadoop fs -chown <user> /user/<your username> 

2. Root HDFS

hadoop fs -ls /

You can put your files in any location

To put your files in user hdfs just leave last parameter as empty(automatically points to users hdfs )



unmesha@unmesha-hadoop-virtual-machine:~/$hadoop fs -put mydata 
unmesha@unmesha-hadoop-virtual-machine:~/$hadoop fs -ls 

Lets run an example

For any programming language there will be a "Hello World" program.

Similary hadoop is having a "Hello World" programs known as "Word Count"Hadoop jobs run basically with 2 directories


1. one directory or files as input

2. and another non-existing directory as output path.Hadoop automatically creates the output path


So for WordCount program input will be a text file and output folder contains the wordcount for that file.You can copy the text file from Here or else copy some paragraph data from Google or any other place and name that file and place it in a folder. 

unmesha@unmesha-hadoop-virtual-machine:~/$cd
unmesha@unmesha-hadoop-virtual-machine:~/$mkdir mydata
unmesha@unmesha-hadoop-virtual-machine:~/$cd mydata
unmesha@unmesha-hadoop-virtual-machine:~/mydata$vi input
# Paste into this input file

Now your input folder is ready.Any hadoop job to run we should place our inputs to HDFS.
MapReduce programs can only run inputs only from HDFS.
So now we need to put mydata to HDFS.

unmesha@unmesha-hadoop-virtual-machine:~/$cd
unmesha@unmesha-hadoop-virtual-machine:~/$hadoop fs -put mydata /

Hadoop shell commands are executed using "hadoop" .

What above command does is : It puts mydata to hdfs
hadoop fs -put local/path hdfs/path

Note: If you have any issues in copying directory to hdfs.It is because of "No permission". Copy or move your mydata directory to /tmp

unmesha@unmesha-hadoop-virtual-machine:~/$mv mydata /tmp

Then try to copy with new input location as the source.

Now We will run wordcount program from hadoop-mapreduce-examples-2.3.0.jar which contains several examples.

Any mapreduce programs that we write are packed as a jar and then we submit the job to cluster.

Basic command to run MapReduce Jobs

hadoop jar jarname.jar MainClass indir outdir

Run wordcount example

unmesha@unmesha-hadoop-virtual-machine:~/$cd
unmesha@unmesha-hadoop-virtual-machine:~/$hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar wordcount /mydata /output1

After finishing the job traverse to output1 to view the result

unmesha@unmesha-hadoop-virtual-machine:~/$hadoop fs -ls -R /output1  
unmesha@unmesha-hadoop-virtual-machine:~/$hadoop fs -cat /output1/part-r-00000    # This shows the wordcount result

For any job the result will be stored in part files.

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are available by default.
http://localhost:50070/ – web UI of the NameNode daemon
http://localhost:50030/ – web UI of the JobTracker daemon
http://localhost:50060/ – web UI of the TaskTracker daemon

You can also track the running Job using a url which is displayed in console while running the job.


14/04/30 12:57:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1398885280814_0002
14/04/30 12:57:19 INFO impl.YarnClientImpl: Submitted application application_1398885280814_0002
14/04/30 12:57:21 INFO mapreduce.Job: The url to track the job: http://ubuntu:8088/proxy/application_1398885280814_0002/
14/04/30 12:57:21 INFO mapreduce.Job: Running job: job_1398885280814_0002

Killing a Job

unmesha@unmesha-hadoop-virtual-machine:~/$cd
unmesha@unmesha-hadoop-virtual-machine:~/$hadoop job -list
job_1398885280814_0002
unmesha@unmesha-hadoop-virtual-machine:~/$hadoop job -kill job_1398885280814_0002
14/04/30 14:02:54 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/04/30 14:03:06 INFO impl.YarnClientImpl: Killed application application_1398885280814_0002
Killed job job_1398885280814_0002

To stop the single node cluster


unmesha@unmesha-hadoop-virtual-machine:~/$stop-all.sh

Hadoop can be installed using cloudera also with less steps in an easy way .The difference is Cloudera packed Apache Hadoop and some ecosystem projects into one package.And they have set all the configuration to localhost and we need not want to set the configuration files.

Installation using Cloudera Package.


Happy Hadooping ...


15 comments:

  1. Nice post thanks for sharing. If any one need Hadoop Interview Questions & Answers and Free Material Click Here

    ReplyDelete
  2. Wow that's a wonderfull blog having all details & helpful. Hadoop cluster NJ

    ReplyDelete
  3. Good post.The map reduce job gets stuck at "Running job" in hadoop 2.6.0.Do you have any solution ?

    ReplyDelete
  4. For latest and updated Cloudera certification dumps in PDF format contact us at completeexamcollection@gmail.com.
    Refer our blog for more details http://completeexamcollection.blogspot.in/2015/04/cloudera-hadoop-certification-dumps.html

    ReplyDelete
  5. Good post got in-depth knowledge of set a software for hadoop. this is likely a normal set up. but while using manual set up installation i got some config errors can you explain it elaborately?
    hadoop training course in chennai

    ReplyDelete
  6. This blog is gives great information on big data hadoop online training in hyderabad, uk, usa, canada.

    best online hadoop training in hyderabad.
    hadoop online training in usa, uk, canada.

    ReplyDelete
  7. Useful blog to read.. Having more useful information and installation procedure are step by step so easy to understand

    best big data training in chennai | best hadoop training | best big data training

    ReplyDelete
  8. Great and helpful blog to everyone.. Installation procedure are very clear and step by so easy to understand.. All installation commands are very clear and i learnt installation procedure easily form this blog so i install hadoop in my system very quickly.. thanks a lot for sharing this blog to us...

    hadoop training in chennai velachery | big data training in chennai velachery

    ReplyDelete
  9. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Hadoop Training in Chennai

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. Great and interesting Blog to read.. i Gathered more useful information about Hadoop Installation from this Blog. thanks a lot for sharing this Blog to us.
    Big Data and Hadoop Training in Pune

    ReplyDelete
  12. Very Well Written Article on Hadoop Technology. Please Post More Post of this Technology To grab latest Updates and Information.
    Hadoop Training in Bangalore
    Hadoop Training in Marathahalli

    ReplyDelete

  13. Hi,
    Thanks for sharing such a great and nice article with us on Hadoop
    We are expecting more articles from you
    Big Data Analytics Training In Hyderabad
    Big Data Analytics Course In Hyderabad

    ReplyDelete
  14. Really it was an awesome article...very interesting to read..You have provided an nice article....Thanks for sharing..
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete