Skip to main content

Querying EMR Using Hive

Hands-On Lab

 

Photo of Fernando Medina Corey

Fernando Medina Corey

Training Architect

Length

01:00:00

Difficulty

Intermediate

MapReduce and Hadoop are two of the most well-known names in big data. This learning activity makes it easy to understand why Amazon's Elastic MapReduce (EMR) service is so important to know. In this hands-on lab, we will use EMR and Hive to query data that we import for an S3 bucket. This is a very common workflow for Big Data engineers, so it is crucial to know.

What are Hands-On Labs?

Hands-On Labs are scenario-based learning environments where learners can practice without consequences. Don't compromise a system or waste money on expensive downloads. Practice real-world skills without the real-world risk, no assembly required.

Querying EMR Using Hive

Introduction

In this hands-on lab, we will use EMR and Hive to query data we import for an S3 bucket.

Solution

Log in to the live AWS environment using the credentials provided. Make sure you're in the N. Virginia (us-east-1) region.

Launch an EMR Cluster

  1. Navigate to EMR.
  2. Click Create cluster.
  3. Leave everything at the default, except for the following values:
    • Cluster name: LinuxacademyCluster
    • Instance type: m4.large
    • Number of instances: 2
  4. Click Create cluster. It will take about 10 minutes to finish creating.

Enable TCP Port 8888 in the EMR Master Security Group

  1. In the Security and access section, right-click the link listed for Security groups for Master to open it in a new browser tab.
  2. Select the ElasticMapReduce-master security group.
  3. Click the Inbound tab to make note of its ports.
  4. Click Edit.
  5. Click Add Rule.
  6. Set the following values:
    • Type: Custom TCP Rule
    • Port Range: 8888
    • Source: 0.0.0.0/0
    • Description: Hue
  7. Click Save.

Query the Data

  1. On the security groups page, copy the master public DNS.

  2. Paste it into a browser tab, and add :8888 at the end.

  3. On the Hue page, type in any username and password you'd like.

  4. Click the S3 icon in the upper left corner (it looks like three cubes).

  5. Click the S3 bucket listed (that starts with cfst-).

  6. Click to open the data folder.

  7. Right-click the videogames file, and select Open in Importer.

  8. Leave the settings as-is, and click Next.

  9. Click Submit.

  10. Click the database icon in the upper left corner (it looks like a cylinder).

  11. Right-click the videogames file, and select Open in Editor.

  12. Enter the following script in the editor:

    SELECT name,platform,year_of_release,user_score,user_count FROM `default`.`videogames`
    WHERE isnotnull(user_score) AND user_score < 10 and user_count > 5000
    ORDER BY user_score DESC;
  13. Click to run the script.

  14. After a minute or so, you should see the results.

Conclusion

Congratulations on completing this hands-on lab!