Skip to main content

Deploy and Configure a Single-Node Hadoop Cluster

Hands-On Lab

 

Photo of Myles Young

Myles Young

BigData Training Architect II in Content

Length

02:00:00

Difficulty

Beginner

Many cloud platforms and third-party service providers offer Hadoop as a service or VM/container image. This lowers the barrier of entry for those wishing to get started with Hadoop. In this hands-on lab, we will have the opportunity to deploy a single-node Hadoop cluster in a pseudo-distributed configuration. Doing so demonstrates the deployment and configuration of each individual component of Hadoop, getting us ready for when we want to start working with a multi-node cluster to separate and cluster Hadoop services. In this learning activity, we will be performing the following: Installing Java Deploying Hadoop from an archive file Configuring Hadoop's JAVA_HOME Configuring the default filesystem for Hadoop Configuring HDFS replication Setting up passwordless SSH Formatting the Hadoop Distributed File System (HDFS) Starting Hadoop Creating files and directories in Hadoop Examining a text file with a MapReduce job

What are Hands-On Labs?

Hands-On Labs are scenario-based learning environments where learners can practice without consequences. Don't compromise a system or waste money on expensive downloads. Practice real-world skills without the real-world risk, no assembly required.

Deploy and Configure a Single-Node Hadoop Cluster

Introduction

Many cloud platforms and third-party service providers offer Hadoop as a service or VM/container image. This lowers the barrier of entry for those wishing to get started with Hadoop. In this hands-on lab, we will have the opportunity to deploy a single-node Hadoop cluster in a pseudo-distributed configuration. Doing so demonstrates the deployment and configuration of each individual component of Hadoop, getting us ready for when we want to start working with a multi-node cluster to separate and cluster Hadoop services. In this learning activity, we will be performing the following:

  • Installing Java
  • Deploying Hadoop from an archive file
  • Configuring Hadoop's JAVA_HOME
  • Configuring the default filesystem for Hadoop
  • Configuring HDFS replication
  • Setting up passwordless SSH
  • Formatting the Hadoop Distributed File System (HDFS)
  • Starting Hadoop
  • Creating files and directories in Hadoop
  • Examining a text file with a MapReduce job

The Scenario

As data engineers for a small company that provides data platforms and analytics services, we've been tasked with installing and configuring a single-node Hadoop cluster. This will be used by our customer to perform language analysis.

For this job, we have been given a bare CentOS 7 cloud server. On it, we will deploy and configure Hadoop in the cloud_user home directory at /home/cloud_user/hadoop. The default filesystem should be set to hdfs://localhost:9000 to facilitate a pseudo-distributed operation. Because it will be a single-node cluster, we must configure the dfs.replication to 1.

After we have deployed, configured, and started Hadoop, we must format and prepare Hadoop to execute a MapReduce job. Specifically, we must download some Latin text from the customer at https://raw.githubusercontent.com/linuxacademy/content-hadoop-quick-start/master/latin.txt and use the hadoop-mapreduce-examples-2.9.2.jar application that ships with Hadoop to determine the average length of the words in the file. The latin.txt file should be copied to Hadoop at /user/cloud_user/latin, and the output of the MapReduce job should be written to /user/cloud_user/latin_wordmean_output.

Note: We can execute the hadoop-mapreduce-examples-2.9.2.jar application without arguments to get usage information if we aren't sure which class and class arguments to use.

Important: Don't forget that we'll need to install Java, configure JAVA_HOME for Hadoop, and set up passwordless SSH to localhost before attempting to start Hadoop.

Logging In

Use the credentials on the hands-on lab overview page to get logged into the server.

Install Java

We can install the java-1.8.0-openjdk package with YUM:

sudo yum install java-1.8.0-openjdk -y

Deploy Hadoop

From the cloud_user home directory, let's download Hadoop-2.9.2 from the desired mirror. We can see a list of mirrors here. We're going to use Gigenet's here:

curl -O http://mirrors.gigenet.com/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz

Now we can unpack the archive in place:

tar -xzf hadoop-2.9.2.tar.gz

Let's delete the archive file:

rm -rf hadoop-2.9.2.tar.gz

Then we can rename the installation directory (getting rid of the version number):

mv hadoop-2.9.2 hadoop

Configure JAVA_HOME

Before we do any configuring of JAVA_HOME, we've got to know where it is. Run a which java real quick to see. We should get /usr/bin/java. But if we then run ll /usr/bin/java, we'll see that it's not a regular executable, merely a symbolic link to /etc/alternatives/java. And if we then run ll /etc/alternatives/java, we'll find that this is also a symbolic link, this time to the real executable. It should be something like /usr/lib/jvm/java-1.8-openjdk-1.8.0.191.b12-1.el7_6.x86_64/jre/bin/java. Say that three times fast!

Now that we know where the java executable is, we can proceed. Edit etc/hadoop/hadoop-env.sh and change the following line:

export JAVA_HOME=${JAVA_HOME}

Change ${JAVA_HOME} to the full path of the executable:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-1.el7_6.x86_64/jre

Now we can save and close the file.

Configure Core Hadoop

Set the default filesystem to hdfs on localhost in /home/cloud_user/hadoop/etc/hadoop/core-site.xml by changing the following lines:

<configuration>
</configuration>

to:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

We're done here, so we can save and close the file.

Configure HDFS

Set the default block replication to 1 in /home/cloud_user/hadoop/etc/hadoop/hdfs-site.xml by changing the following lines:

<configuration>
</configuration>

to:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Then let's save and close the file.

Set up Passwordless SSH Access to localhost

As cloud_user, generate a public/private RSA key pair with ssh-keygen. The default option for each prompt will suffice:

cd ~/.ssh
ssh-keygen

Now we can add our newly generated public key to our authorized keys list with this:

cat ~/.ssh/id_rsa.pub >>  ~/.ssh/authorized_keys

We can test passwordless SSH to localhost with:

ssh localhost

Add localhost to the list of known hosts by accepting its key with yes, then get out of the SSH session exit.

Format the Filesystem

Let's get back into the hadoop directory:

cd ~/hadoop

Now we can format the DFS with bin/hdfs namenode -format.

Start Hadoop

We're going to start Start the NameNode and DataNode daemons from /home/cloud_user/hadoop with this:

sbin/start-dfs.sh

There will be two prompts to accept keys: one for localhost and one for 0.0.0.0.

Once everything finishes running, we can test for whether it worked or not:

bin/hdfs dfs -la /

We shouldn't get any output, which is fine. We haven't written anything to that filesystem yet, so there shouldn't be any output.

Download and Copy the Latin Text to Hadoop

Let's make sure we're in /home/cloud_user/hadoop (and change to that directory if we're not already there), then download the latin.txt file with this:

curl -O https://raw.githubusercontent.com/linuxacademy/content-hadoop-quick-start/master/latin.txt

Now we've got to create the /user and /user/cloud_user directories in Hadoop:

bin/hdfs dfs -mkdir -p /user/cloud_user

Finally, we can copy the latin.txt file to Hadoop at /user/cloud_user/latin with:

bin/hdfs dfs -put latin.txt latin

We can check to make sure the file made the trip with this:

dfs -ls /user/cloud_user

Examine the latin.txt Text with MapReduce

Our last task entails using hadoop-mapreduce-examples-2.9.2.jar to calculate the average length of the words in the /user/cloud_user/latin file. We'll save the job output to /user/cloud_user/latin_wordmean_output in Hadoop with:

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordmean latin latin_wordmean_output

Now let's look at our wordmean job output files:

bin/hdfs dfs -cat latin_wordmean_output/*

Conclusion

We've got everything up and running, and now have a way to calculate the average word length in a text file. Congratulations!