Skip to main content

Running a Pyspark Job on Cloud Dataproc Using Google Cloud Storage

Hands-On Lab

 

Photo of Matthew Ulasien

Matthew Ulasien

Team Lead Google Cloud in Content

Length

00:30:00

Difficulty

Intermediate

This hands-on lab introduces how to use Google Cloud Storage as the primary input and output location for Dataproc cluster jobs. Leveraging GCS over the Hadoop Distributed File System (HDFS) allows us to treat clusters as ephemeral entities, so we can delete clusters that are no longer in use, while still preserving our data.

What are Hands-On Labs?

Hands-On Labs are scenario-based learning environments where learners can practice without consequences. Don't compromise a system or waste money on expensive downloads. Practice real-world skills without the real-world risk, no assembly required.

Running a Pyspark Job on Cloud Dataproc Using Google Cloud Storage

Introduction

In hands-on lab, we will cover how to use Google Cloud Storage as the primary input and output location for Dataproc cluster jobs.

Solution

On the lab page, right click Open GCP Console and select the option to open it in a new private browser window.

This option will read differently depending on the browser being used:

  • In Chrome it says "Open Link in Incognito Window".

  • In Firefox it says "Open link in new private window."

  • In Microsoft Edge, the message will be "Open in InPrivate window."

  • In Safari, press Alt or Option, then right click to get a menu where we will choose "Open link in new private window."

This will avoid any cached login issues. Once we're at the login screen, sign into Google Cloud Platform using the login info provided on the Credenitals section of the hands-on lab page.

On the Welcome to your new account screen, review the text, and click Accept. In the "Welcome L.A.!" window that pops up once we're signed in, check to agree to the terms of service, choose country of residence, and then click Agree and Continue.

Prepare Our Environment

  1. Click the "square command icon" in the top right of the screen.

  2. Click START CLOUD SHELL.

  3. First, we need to enable the Dataproc API, with:

    gcloud services enable dataproc.googleapis.com
  4. Next, we will create a Cloud Storage bucket:

    gsutil mb -l us-central1 gs://$DEVSHELL_PROJECT_ID-data
  5. Verify in the web console that the bucket was created by accessing the top-left menu, and then click Storage.

    Note: We should see that our bucket name is identical to the project ID, followed by -data.

  6. Now create the ephemeral dataproc cluster:

    gcloud dataproc clusters create wordcount --zone=us-central1-f --single-node --master-machine-type=n1-standard-2
  7. Verify that the dataproccluster was created by accessing the top-left menu, and then click Dataproc underneath the BIG DATA section.

    Note: It can take a few minutes for the creation process to complete.

  8. Finally, download the wordcount.py file that will be used for the pyspark job:

    gsutil cp -r gs://la-gcp-labs-resources/data-engineer/dataproc/* .
  9. We can view the directory contents with the ls command.

  10. Look at the wordcount.py file directly with the following command:

    vim wordcount.py
  11. Use the following command to view the file that we will be copying shortly:

    vim romeoandjuliet.txt

    Submit the Pyspark Job to the Dataproc Cluster

  12. In Cloud Shell, type the following, and then hit enter:

    gcloud dataproc jobs submit pyspark wordcount.py --cluster=wordcount -- 
    gs://la-gcp-labs-resources/data-engineer/dataproc/romeoandjuliet.txt 
    gs://$DEVSHELL_PROJECT_ID-data/output/
  13. View the progress by going back to the Dataproc page's Cluster Details section, and click Jobs on the top-left menu to access the job.

    Note: This job may take approximately 30-45 seconds to complete, and we should see a confirmation message in the Job Details section when clicking on the job.

  14. Navigate to the top-left menu, and then click Storage.

  15. Click on the data location bucket.

    Note: Do not click on the staging bucket that has dataproc its name.

  16. Click the output folder.

Review the Pyspark Output

  1. In Cloud Shell, download output files from the GCS output location:

    gsutil cp -r gs://$DEVSHELL_PROJECT_ID-data/output/* .

    Note: Alternatively, we could download them to our local machine via the web console.

  2. We can view the contents again, with the ls command.

  3. Use the following to see an output file:

    vim part-00001
  4. Use the following to see the output file:

    vim part-00000

Delete the Dataproc Cluster

  1. We don't need our cluster any longer, so let's delete it. In the web console, go to the top-left menu and into BIGDATA > Dataproc.

  2. Select the wordcount cluster, then click DELETE > OK to confirm.

    Our job output still remains in Cloud Storage, allowing us to delete Dataproc clusters when no longer in use to save costs, while preserving input and output resources.

Conclusion

Congratulations - you've completed this hands-on lab!