Hadoop installation modes – Let’s explore

Hadoop mode

Hadoop mode

Overview: Apache Hadoop can be installed in different modes as per the requirement. These different modes are configured during installation. By default, Hadoop is installed in Standalone mode. The other modes are Pseudo distributed mode and distributed mode. The purpose of this tutorial is to explain different installation modes in a simple way so that the readers can follow it and do their own work.

In this article, I will discuss different installation modes and their details.

Introduction: We all know that Apache Hadoop is an open source framework which allows distributed processing of large sets of data set across different clusters using simple programming. Hadoop has the ability to scale up to thousands of computers from a single server. Thus in these conditions installation of Hadoop becomes most critical. We can install Hadoop in three different modes –

  • Standalone mode – Single Node Cluster
  • Pseudo distributed mode – Single Node Cluster
  • Distributed mode. – Multi Node Cluster

Purpose of different installation modes: When Apache Hadoop is used in a production environment, multiple server nodes are used for distributed computing. But for understanding the basics and playing around with Hadoop, single node installation is sufficient. There is another mode known as ‘pseudo distributed’ mode. This mode is used to simulate the multi node environment on a single server.

In this document we will discuss how to install Hadoop on Ubunto Linux. Be it any mode, the system should have java version 1.6.x installed on it. 

Standalone mode installation: Now, let us check the standalone mode installation process by following the steps mentioned below.

Install Java –
Java (JDK Version 1.6.x) either from Sun/Oracle or Open Java is required.

  • Step 1 – If you are not able to switch to OpenJDK instead of using proprietary Sun JDK/JRE, install sun-java6 from Canonical Partner Repository by using the following command.

Note: The Canonical Partner Repository contains free of cost closed source third party software. But the Canonical does not have access to the source code instead they just package and test it.

Add the canonical partner to the apt repositories using –

[Code]

$ sudo add-apt-repository “deb http://archive.canonical.com/lucid partner”

[/Code]

  • Step 2 – Update the source list.

[Code]

$ sudo apt-get update

[/Code]

  • Step 3 – Install JDK version 1.6.x from Sun/Oracle.

[Code]

$ sudo apt-get install sun-java6-jdk

[/Code]

  • Step 4 – Once JDK installation is over make sure that it is correctly setup using – version 1.6.x from Sun/Oracle.

[Code]

user@ubuntu:~# java -version        java version “1.6.0_45”        Java(TM) SE Runtime Environment (build 1.6.0_45-b02)        Java HotSpot(TM) Client VM (build 16.4-b01, mixed mode, sharing)

[/Code]

Add Hadoop User

  • Step 5 – Add a dedicated Hadoop unix user into you system as under to isolate this installation from other software –

[Code]

$ sudo adduser hadoop_admin

[/Code]

Download the Hadoop binary and install

  • Step 6 – Download Apache Hadoop from the apache web site. Hadoop comes in the form of tar-gx format. Copy this binary into the /usr/local/installables folder. The folder – installables should be created first under /usr/local before this step. Now run the following commands as sudo

[Code]

$ cd /usr/local/installables        $ sudo tar xzf hadoop-0.20.2.tar.gz        $ sudo chown -R hadoop_admin /usr/local/hadoop-0.20.2

[/Code]

Define env variable – JAVA_HOME

  • Step 7 – Open the Hadoop configuration file (hadoop-env.sh) in the location – /usr/local/installables/hadoop-0.20.2/conf/hadoop-env.sh and define the JAVA_HOME as under –

[Code] export JAVA_HOME=path/where/jdk/is/installed [/Code]

(e.g. /usr/bin/java)

Installation in Single mode

  • Step 8 – Now go to the HADOOP_HOME directory (location where HADOOP is extracted) and run the following command –

[Code]

$ bin/hadoop

[/Code]

The following output will be displayed –

[Code]               Usage: hadoop [–config confdir] COMMAND

[/Code]

Some of the COMMAND options are mentioned below. There are other options available and can be checked using the command mentioned above.

[Code]        namenode -format                format the DFS filesystem        secondarynamenode               run the DFS secondary namenode        namenode                        run the DFS namenode        datanode                        run a DFS datanode        dfsadmin                        run a DFS admin client        mradmin                         run a Map-Reduce admin client        fsck                            run a DFS filesystem checking utility

[/Code]

The above output indicates that Standalone installation is completed successfully. Now you can run the sample examples of your choice by calling –

[Code]   $  bin/hadoop jar hadoop-*-examples.jar <NAME> <PARAMS>[/Code]

Pseudo distributed mode installation: This is a simulated multi node environment based on a single node server.
Here the first step required is to configure the SSH in order to access and manage the different nodes. Thus it is mandatory to have the SSH access to the different nodes. Once the SSH is configured, enabled and is accessible we should start configuring the Hadoop. The following configuration files needs to be modified –

  • conf/core-site.xml
  • conf/hdfs-site.xml
  • conf/mapred.xml 

Open the all the configuration files in vi editor and update the configuration. 

Configure core-site.xml file:

[Code]$ vi conf/core-site.xml[/Code] [Code]<configuration><property><name>fs.default.name</name><value>hdfs://localhost:9000</value></property><property><name>hadoop.tmp.dir</name><value>/tmp/hadoop-${user.name}</value></property></configuration>[/Code]

Configure hdfs-site.xml file:

[Code]$ vi conf/hdfs-site.xml[/Code] [Code]<configuration><property><name>dfs.replication</name><value>1</value></property></configuration>[/Code]

Configure mapred.xml file:

[Code]$ vi conf/mapred.xml[/Code] [Code]<configuration><property><name>mapred.job.tracker</name> <value>localhost:9001</value></property></configuration>[/Code] Once these changes are done, we need to format the name node by using the following command. The command prompt will show all the messages one after another and finally success message.  [Code]$ bin/hadoop namenode –format[/Code] Now our setup is done for pseudo distributed node. Let’s now start the single node cluster by using the following command. It will again show some set of messages on the command prompt and start the server process. [Code]$ /bin/start-all.sh[Code]  Now we should check the status of Hadoop process by executing the jps command as shown below. It will show all the running processes. [Code]$ jps 14799 NameNode14977 SecondaryNameNode 15183 DataNode15596 JobTracker15897 TaskTracker[/Code]

Stopping the Single node Cluster:  We can stop the single node cluster by using the following command. The command prompt will display all the stopping processes.

[Code]$ bin/stop-all.sh stopping jobtrackerlocalhost: stopping tasktrackerstopping namenodelocalhost: stopping datanodelocalhost: stopping secondarynamenode[/Code]

Distributed mode installation:
Before we start the distributed mode installation, we must ensure that we have the pseudo distributed setup done and we have at least two machines, one acting as master and the other one acting as a slave. Now we run the following commands in sequence.

·         $ bin/stop-all.sh – Make sure none of the nodes are running

  • Open the /etc/hosts file and add the following entries for master and slave –

<IP ADDRESS> master

<IP ADDRESS> slave

  • $ ssh-copy-id -i $HOME/.ssh/id_rsa.pub slave – This command should be executed on master to have the passwordless ssh. We should login using the same username on all the machines. If we need a password, we can set it manually.
  • Now we open the two files – conf/master and conf/slaves. The conf/master defines the name nodes of our multi node cluster. The conf/slaves file lists the hosts where the Hadoop Slave will be running.
  • Edit the conf/core-site.xml file to have the following entries –

<property>

<name>fs.default.name</name>

<value>hdfs://master:54310</value>

</property>

  • Edit the conf/mapred-site.xml file to have the following entries –

<property>

<name>mapred.job.tracker</name>

<value>hdfs://master:54311</value>

</property>

  • Edit the conf/hdfs-site.xml file to have the following entries –

<property>

<name>dfs.replication</name>

<value>2</value>

</property>

  • Edit the conf/mapred-site.xml file to have the following entries –

<property>

<name>mapred.local.dir</name>

<value>${hadoop-tmp}/mapred/local</value>

</property>

<property>

<name>mapred.map.tasks</name>

<value>50</value>

</property>

<property>

<name>mapred.reduce.tasks</name>

<value>5</value>

</property>

Now start the master by using the following command.

[Code] bin/start-dfs.sh [/Code]

Once started, check the status on the master by using jps command. You should get the following output –

[Code]

14799 NameNode

15314 Jps
16977 secondaryNameNode

[/Code]

On the slave the output should be as shown below.

[Code]

15183 DataNode
15616 Jps

[/Code]

Now start the MapReduce daemons by using the following command.

[Code]

$ bin/start-mapred.sh

[/Code]

Once started check the status on the master by using jps command. You should get the following output –

[Code]

16017 Jps

14799 NameNode

15596 JobTracker

14977 SecondaryNameNode

[/Code]

And on the slaves the output should be as shown below.

[Code]

15183 DataNode

15897 TaskTracker
16284 Jps

[/Code]

Summary: In the above discussion we have covered different Hadoop installation modes and their technical details. But we should be careful when selecting the installation mode. Different modes have their own purpose. So the beginners should start with single mode installation and then proceed with other options.
Let us summarize our discussion with the following bullets

  • Apache Hadoop can be installed in three different modes –
    • Single node
    • Pseudo distributed node
    • Distributed node
  • Single mode is the simplest way to install and get started.
  • If we need clusters but have only one node available, then we should go for Pseudo distributed mode
  • To install the distributed mode we should have the pseudo distributed mode installed first.
Tagged on:
============================================= ============================================== Buy best TechAlpine Books on Amazon
============================================== ---------------------------------------------------------------- electrician ct chestnutelectric
error

Enjoy this blog? Please spread the word :)

Follow by Email
LinkedIn
LinkedIn
Share