Running Hadoop on Ubuntu Linux (Single-Node Cluster)
In this tutorial I will describe the required steps for setting up a pseudo-distributed, single-node
Hadoop cluster backed by the Hadoop Distributed File System, running on Linux Mint Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates
features similar to those of the Google File System (GFS) and of the
MapReduce computing paradigm.
Hadoop’s HDFS is a highly fault-tolerant distributed file
system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to
application data and is suitable for applications that have large data sets.
The main goal of this tutorial is to get a simple Hadoop installation up and running so that you can play around with
the software and learn more about it.
This tutorial has been tested with the following software versions:
Hadoop requires a working Java 1.5+ (aka Java 5) installation. However, using
Java 1.6 (aka Java 6) is recommended for running Hadoop. For the
sake of this tutorial, I will therefore describe the installation of Java 1.6.
123456789
# Add the repository to your apt repositories# See https://launchpad.net/~ferramroberto/$ sudo add-apt-repository ppa:webupd8team/java
# Update the source list$ sudo apt-get update
# Install Oracle Java 6 JDK$ sudo apt-get install oracle-java6-installer
The full JDK which will be placed in /usr/lib/jvm/java-6-oracle
After installation, make a quick check whether Sun’s JDK is correctly set up:
1234
$ java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)
Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop
on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to
configure SSH access to localhost for the nitesh user we created in the previous section.
I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication. If
not, there are several online guides available.
First, we have to generate an SSH key for the nitesh user.
123456789101112
nitesh@nitrek:~$ su - nitesh
nitesh@nitrek$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/nitesh/.ssh/id_rsa):
Created directory '/home/nitesh/.ssh'.
Your identification has been saved in /home/nitesh/.ssh/id_rsa.
Your public key has been saved in /home/nitesh/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 nitesh@nitrek
The key's randomart image is:
[...snipp...]
nitesh@nitrek$
The second line will create an RSA key pair with an empty password. Generally, using an empty password is not
recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter
the passphrase every time Hadoop interacts with its nodes).
Second, you have to enable SSH access to your local machine with this newly created key.
The final step is to test the SSH setup by connecting to your local machine with the nitesh user. The step is
also needed to save your local machine’s host key fingerprint to the nitesh user’s known_hosts file. If you
have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific
SSH options in $HOME/.ssh/config (see man ssh_config for more information).
12345678910
nitesh@nitrek$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS
[...snipp...]
nitesh@nitrek$
If the SSH connect should fail, these general tips might help:
Enable debugging with ssh -vvv localhost and investigate the error in detail.
Check the SSH server configuration in /etc/ssh/sshd_config, in particular the options PubkeyAuthentication
(which should be set to yes) and AllowUsers (if this option is active, add the nitesh user to it). If you
made any changes to the SSH server configuration file, you can force a configuration reload with
sudo /etc/init.d/ssh reload.
You have to reboot your machine in order to make the changes take effect.
Download Hadoop from the
Apache Download Mirrors and extract the contents of the Hadoop
package to a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the
files to the nitesh user and hadoop group, for example:
(Just to give you the idea, YMMV – personally, I create a symlink from hadoop-1.1.2 to hadoop.)
Update $HOME/.bashrc
Add the following lines to the end of the $HOME/.bashrc file of user nitesh. If you use a shell other than
bash, you should of course update its appropriate configuration files instead of .bashrc.
$HOME/.bashrc
123456789
# Set Hadoop-related environment variablesexport HADOOP_PREFIX==/home/nitesh/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)export JAVA_HOME=/usr/lib/jvm/java-6-oracle
# Add Hadoop bin/ directory to PATHexport PATH=$PATH:$HADOOP_PREFIX/bin
You can repeat this exercise also for other users who want to use Hadoop.
Excursus: Hadoop Distributed File System (HDFS)
Before we continue let us briefly learn a bit more about Hadoop’s distributed file system.
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop project, which is part of the Apache Lucene project.
Configuration
Our goal in this tutorial is a single-node setup of Hadoop. More information of what we do in this section is available on the Hadoop Wiki.
hadoop-env.sh
The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open
conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path
is /usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6
directory.
Change
conf/hadoop-env.sh
12
# The java implementation to use. Required.# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
to
conf/hadoop-env.sh
12
# The java implementation to use. Required.export JAVA_HOME=/usr/lib/jvm/java-6-oracle
conf/*-site.xml
In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens
to, etc. Our setup will use Hadoop’s Distributed File System,
HDFS, even though our little “cluster” only contains our
single local machine.
You can leave the settings below “as is” with the exception of the hadoop.tmp.dir parameter – this parameter you
must change to a directory of your choice. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop’s
default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS,
so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.
Now we create the directory and set the required ownerships and permissions:
123
$ mkdir /home/nitesh/tmp
#Check if folder is made
$ls
Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration
XML file.
In file conf/core-site.xml:
conf/core-site.xml
123456789101112131415
<property><name>hadoop.tmp.dir</name><value>/app/hadoop/tmp</value><description>A base for other temporary directories.</description></property><property><name>fs.default.name</name><value>hdfs://localhost:54310</value><description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description></property>
In file conf/mapred-site.xml:
conf/mapred-site.xml
12345678
<property><name>mapred.job.tracker</name><value>localhost:54311</value><description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description></property>
In file conf/hdfs-site.xml:
conf/hdfs-site.xml
12345678
<property><name>dfs.replication</name><value>1</value><description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description></property>
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top
of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You
need to do this the first time you set up a Hadoop cluster.
Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the
command
nitesh@nitrek:/usr/local/hadoop$ bin/hadoop namenode -format
10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=nitesh,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-nitesh/dfs/name has been successfully formatted.
10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
nitesh@nitrek:/usr/local/hadoop$
Starting your single-node cluster
Run the command:
1
nitesh@nitrek$ /usr/local/hadoop/bin/start-all.sh
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
The output will look like this:
1234567
nitesh@nitrek:/usr/local/hadoop$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-nitesh-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-nitesh-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-nitesh-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-nitesh-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-nitesh-tasktracker-ubuntu.out
nitesh@nitrek:/usr/local/hadoop$
A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Sun’s Java since
v1.5.0). See also How to debug MapReduce programs.
We will now run your first Hadoop MapReduce job. We will use the
WordCount example job which reads text files and counts how often words
occur. The input is text files and the output is text files, each line of which contains a word and the count of how
often it occurred, separated by a tab. More information of
what happens behind the scenes is available at the
Hadoop Wiki.
Download example input data
We will use three ebooks from Project Gutenberg for this example:
Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a local temporary directory
of choice, for example /tmp/gutenberg.
123456
nitesh@nitrek$ ls -l /tmp/gutenberg/
total 3604
-rw-r--r-- 1 nitesh hadoop 674566 Feb 3 10:17 pg20417.txt
-rw-r--r-- 1 nitesh hadoop 1573112 Feb 3 10:18 pg4300.txt
-rw-r--r-- 1 nitesh hadoop 1423801 Feb 3 10:18 pg5000.txt
nitesh@nitrek$
Restart the Hadoop cluster
Restart your Hadoop cluster if it’s not running already.
1
nitesh@nitrek$ /usr/local/hadoop/bin/start-all.sh
Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.
nitesh@nitrek:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/nitesh/gutenberg /user/nitesh/gutenberg-output
This command will read all the files in the HDFS directory /user/nitesh/gutenberg, process it, and store the result
in the HDFS directory /user/nitesh/gutenberg-output.
Note: Some people run the command above and get the following error message:
Exception in thread "main" java.io.IOException: Error opening job jar: hadoop*examples*.jar
at org.apache.hadoop.util.RunJar.main (RunJar.java: 90)
Caused by: java.util.zip.ZipException: error in opening zip file
In this case, re-run the command with the full name of the Hadoop Examples JAR file, for example:
nitesh@nitrek:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-1.1.2.jar wordcount /user/nitesh/gutenberg /user/nitesh/gutenberg-output
Example output of the previous command in the console:
1234567891011121314151617181920212223242526272829
nitesh@nitrek:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/nitesh/gutenberg /user/nitesh/gutenberg-output
10/05/08 17:43:00 INFO input.FileInputFormat: Total input paths to process : 3
10/05/08 17:43:01 INFO mapred.JobClient: Running job: job_201005081732_0001
10/05/08 17:43:02 INFO mapred.JobClient: map 0% reduce 0%
10/05/08 17:43:14 INFO mapred.JobClient: map 66% reduce 0%
10/05/08 17:43:17 INFO mapred.JobClient: map 100% reduce 0%
10/05/08 17:43:26 INFO mapred.JobClient: map 100% reduce 100%
10/05/08 17:43:28 INFO mapred.JobClient: Job complete: job_201005081732_0001
10/05/08 17:43:28 INFO mapred.JobClient: Counters: 17
10/05/08 17:43:28 INFO mapred.JobClient: Job Counters
10/05/08 17:43:28 INFO mapred.JobClient: Launched reduce tasks=1
10/05/08 17:43:28 INFO mapred.JobClient: Launched map tasks=3
10/05/08 17:43:28 INFO mapred.JobClient: Data-local map tasks=3
10/05/08 17:43:28 INFO mapred.JobClient: FileSystemCounters
10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_READ=2214026
10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_READ=3639512
10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3687918
10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=880330
10/05/08 17:43:28 INFO mapred.JobClient: Map-Reduce Framework
10/05/08 17:43:28 INFO mapred.JobClient: Reduce input groups=82290
10/05/08 17:43:28 INFO mapred.JobClient: Combine output records=102286
10/05/08 17:43:28 INFO mapred.JobClient: Map input records=77934
10/05/08 17:43:28 INFO mapred.JobClient: Reduce shuffle bytes=1473796
10/05/08 17:43:28 INFO mapred.JobClient: Reduce output records=82290
10/05/08 17:43:28 INFO mapred.JobClient: Spilled Records=255874
10/05/08 17:43:28 INFO mapred.JobClient: Map output bytes=6076267
10/05/08 17:43:28 INFO mapred.JobClient: Combine input records=629187
10/05/08 17:43:28 INFO mapred.JobClient: Map output records=629187
10/05/08 17:43:28 INFO mapred.JobClient: Reduce input records=102286
Check if the result is successfully stored in HDFS directory /user/nitesh/gutenberg-output:
If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the "-D" option:
1
nitesh@nitrek:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount -D mapred.reduce.tasks=16 /user/nitesh/gutenberg /user/nitesh/gutenberg-output
An important note about mapred.map.tasks: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but you can specify mapred.reduce.tasks.
Retrieve the job result from HDFS
To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the command
to read the file directly from HDFS without copying it to the local file system. In this tutorial, we will copy the
results to the local file system though.
Note that in this specific output the quote signs (“) enclosing the words in the head output above have not been
inserted by Hadoop. They are the result of the word tokenizer used in the WordCount example, and in this case they
matched the beginning of a quote in the ebook texts. Just inspect the part-00000 file further to see it for
yourself.
The command fs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.
Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these
locations:
These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give
them a try.
NameNode Web Interface (HDFS layer)
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead
nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web
browser. It also gives access to the local machine’s Hadoop log files.
By default, it’s available at http://localhost:50070/.
JobTracker Web Interface (MapReduce layer)
The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed
jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which
the web UI is running on).
By default, it’s available at http://localhost:50030/.
TaskTracker Web Interface (MapReduce layer)
The task tracker web UI shows you running and non-running tasks. It also gives access to the ‘‘local machine’s’’ Hadoop log files.
By default, it’s available at http://localhost:50060/.
comment