Google Cloud Certified Professional Data Engineer

Course

January 18th, 2019

Intro Video

Photo of Matthew Ulasien

Matthew Ulasien

Team Lead Google Cloud in Content

Length

20:07:45

Difficulty

Advanced

Course Details

The Google Cloud Professional Data Engineer is able to harness the power of Google's big data capabilities and make data-driven decisions by collecting, transforming, and visualizing data. Through designing, building, maintaining, and troubleshooting data processing systems with a particular emphasis on the security, reliability, fault-tolerance, scalability, fidelity, and efficiency of such systems, a Google Cloud data engineer is able to put these systems to work.

This course will prepare you for the Google Cloud Professional Data Engineer exam by diving into all of Google Cloud's data services. With interactive demonstrations and an emphasis on hands-on work, you will learn how to master each of Google's big data and machine learning services and become a certified data engineer on Google Cloud.

Download the Data Dossier: https://interactive.linuxacademy.com/diagrams/TheDataDossier.html

Syllabus

Course Introduction

Getting Started

Course Introduction

00:05:48

Lesson Description:

Welcome to the Google Cloud Professional Data Engineer course. In this lesson, we will introduce the course, go over who this course is for, pre-requisites, and how to prepare for the live exam. Additional study resources for external technologies are below, which we will also cover near the end of this course and in our interactive lucidchart document: SQL deep dive Course - SQL Primer https://linuxacademy.com/cp/modules/view/id/52 Machine Learning Google Machine Learning Crash Course (free) https://developers.google.com/machine-learning/crash-course/ Hadoop Hadoop Quick Start https://linuxacademy.com/cp/modules/view/id/294 Apache Beam (Dataflow) Google's guide to designing your pipeline with Apache Beam (using Java) https://cloud.google.com/dataflow/docs/guides/beam-creating-a-pipeline

About the Training Architect

00:00:41

Lesson Description:

Learn more about your training architect for this course, Matthew Ulasien.

Intro to the Data Dossier - Interactive Study Guide

00:04:53

Lesson Description:

Introduction of our interactive learning tool, the Data Dossier, that we will be using throughout this course. The link to access the Data Dossier is below. You can also access it from the Downloads section of this course. https://www.lucidchart.com/documents/view/0ca44a63-4ea4-4d78-8367-2465512d21be

Course and Exam Overview

00:05:06

Lesson Description:

In this lesson we will cover: What we can expect to see on the exam.How this course will be structured to match the exam.What topics and Google Cloud services we can expect to be tested.

What is a Data Engineer

00:03:19

Lesson Description:

What exactly is the role of a data engineer? This early lesson will go over both Google's definition of a data engineer, and a more simplified version as well. As we go through this course, think about how the role of a data engineer matches the topics we will cover.

Foundational Concepts

Data Lifecycle

00:11:41

Lesson Description:

The Data Lifecycle describes the entire start to end process of collecting, storing, analyzing, and visualizing data. We need to be familiar with the cycle, and what Google Cloud services are associated with each step.

Batch and Streaming Data

00:05:28

Lesson Description:

We are going to discuss the difference between streaming (or real-time) data ingest and batch (or bulk) data ingest. These topics will come up througout this course.

Cloud Storage as Staging Ground

00:07:16

Lesson Description:

We will continue discussing foundational concepts by learning how Google Cloud Storage is a common staging ground for many data engineering workflows.

Database Types

00:07:21

Lesson Description:

We will conclude this section by discussing the differences between relational and non-relational databases.

Monitoring Unmanaged Databases

00:06:04

Lesson Description:

Unmanaged databases such as MySQL, MariaDB, Cassandra, etc will be hosted on Compute Engine instances. You will need to know how to properly monitor these database applications, which we will cover in this lesson.

Google Cloud Data Engineer - Foundational Concepts

00:30:00

Managed Databases

Cloud SQL

Choosing a Managed Database

00:07:15

Lesson Description:

We are now going to explore the different managed database services on Google Cloud. We will start by taking a big-picture perspective on what the exam will expect you to know in choosing a certain database given a set of conditions, and we will go into more detail as we look closer at each one.

Cloud SQL Basics

00:05:07

Lesson Description:

Cloud SQL is managed by MySQL/PostgreSQL. What exactly does it 'manage' compared to hosting a MySQL database on Compute Engine? This lesson will go into specific differences on what is automated and handled for you so you can focus on your data, not server maintenance. **The limitation for Cloud SQL has recently changed to 30 TB from 10 TB. You can view updates to the limitation by clicking here.

Cloud SQL Hands On

00:12:46

Lesson Description:

We will now take the previous lesson's concepts, and go hands-on with creating a managed MySQL instance using CloudSQL, and view options for automating backups, scaling, and others.

Importing Data

00:12:10

Lesson Description:

In this lesson, we will cover how to bulk import data into a Cloud SQL session, including SQL dump and CSV files. Detailed steps of what was performed in the lesson are below: We will be using a sample data GitHub repo here: https://github.com/linuxacademy/googledataengineer The original source of this data from Google's GitHub can be viewed here: https://github.com/GoogleCloudPlatform/training-data-analyst Clone our sample data to Cloud Shell: Open Cloud Shell.Clone the data from repo to Cloud Shell.git clone https://github.com/linuxacademy/googledataengineer Create the Cloud Storage bucket to copy data: gsutil mb -l (your_region) gs://(bucket_name) Browse to the CloudSQL sample data directory: cd googledataengineer/CPB100/lab3a/cloudsql Copy all data into your cloud bucket: gsutil cp * gs://(bucket_name) Import an SQL dump file into Cloud SQL: From Cloud SQL, click on the instance and click the Import buttonClick Browse, select bucket, browse to table_creation.sql, and click Select.Under the "Format of import" options, make sure SQL is selected as the import format (should be default).Click Import. Import CSV tables into SQL database (recommendation_spark): From Cloud SQL, click on the instance and click the Import button.Click Browse, select bucket, browse to accommodation.csv, and click Select.Expand advanced options, and from the Database drop-down menu, select recommendation_spark.Set the Table name to Accommodation.Click Import.Perform the same actions for the rating.csv file as well, setting the Table name to Rating. Connect to your Cloud SQL instance: Click Connect using Cloud Shell.In Cloud Shell, hit enter once the command is populated.Enter the root password when prompted. View tables and table data: Switch to database: use recommendation_spark; View tables: show tables; View contents of one of the tables: select * from Rating;

SQL Query Best Practices

00:02:48

Lesson Description:

You may see some general SQL query and performance best practices questions on the exam, so we wanted to be sure we covered some of the general scenarios you may run across.

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

00:30:00

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

00:30:00

Datastore

Datastore Overview

00:09:21

Lesson Description:

Datastore is a No Ops, highly scalable, NoSQL database option. What exactly does this mean, why would you want to use it over other storage options, and how does this relate to the exam? We will cover all of that in this lesson.

Data Organization

00:16:03

Lesson Description:

In this lesson, we will go hands-on with Datastore and also cover how data is organized and relates to each other.

Queries and Indexing

00:11:29

Lesson Description:

We are going to discuss how to use queries to find information in your database and how indexing works in order to deliver those results quickly. We will want to pay special attention to how to avoid 'exploding indexing', which can increase storage and affect performance.

Data Consistency

00:06:18

Lesson Description:

Datastore can support both strong and eventual data consistency in its operations. We will go over a conceptual understanding of these terms, which will be important for the exam.

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

00:30:00

Bigtable

Bigtable Overview

00:07:44

Lesson Description:

We will take a high-level look at Bigtable, its history, how it differs from other managed databases, and take a detailed look at its underlying infrastructure.

Instance Configuration

00:18:11

Lesson Description:

In this lesson, we will go hands-on with creating a Bigtable instance, and explore how to resize instances and some basic table interaction. The steps we took to interact with our table via cbt command line are below for your reference. Install the cbt command in Google SDK: gcloud components update gcloud components install cbt Configure cbt to use your project and instance via .cbtrc file: echo -e "project = [PROJECT_ID]ninstance = [INSTANCE_ID]" > ~/.cbtrc Create a table: cbt createtable my-table List the table: cbt ls Add a column family: cbt createfamily my-table cf1 List column family cbt ls my-table Add value to row 1, using column family cf1 and column qualifier c1: cbt set my-table r1 cf1:c1=test-value Use the cbt read command to read the data you added to the table: cbt read my-table Delete the table (if not deleting instance): cbt deletetable my-table

Data Organization

00:05:28

Lesson Description:

We will take a closer look at the structure of a Bigtable table, how it is indexed and queries, and the importance of well-designed schema, which we will explore further in the next lesson.

Schema Design

00:08:37

Lesson Description:

Proper row key schema is critical to efficient indexing and performance and may be testable as well. We will cover the basics of a well-designed schema in this lesson.

Cloud Spanner

Cloud Spanner Overview

00:11:17

Lesson Description:

In this lesson, we will introduce ourselves to Cloud Spanner, what makes it different from the other managed database options, and how the underlying architecture works. Cloud Spanner is simply defined as a no-compromises, highly scalable, relational database.

Data Organization and Schema

00:07:12

Lesson Description:

We will take a closer look at how tables in Cloud Spanner are different from a traditional RDBMS in that tables have a parent/child relationship and tables 'interleave' with each other.

Hands On and Viewing Examples

00:11:46

Lesson Description:

It is time to go hands-on to create a Spanner instance, and we will also populate another instance using Python scripts in order to exam a more complete database example. The below steps ran from Cloud Shell will create a Spanner instance and populate Singer and Album data for you. Link Google Cloud documentation to follow along with for the Python example: https://cloud.google.com/spanner/docs/getting-started/python/ Clone the GitHub repository to run scripts, and browse to the correct directory: git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git cd python-docs-samples/spanner/cloud-client Create a Python environment and install dependencies: virtualenv env source env/bin/activate pip install -r requirements.txt Create a Spanner instance named test-instance: gcloud spanner instances create test-instance --config=regional-us-central1 --description="Test Instance" --nodes=1 Create a database and insert data using the Python scripts from our GitHub clone: python snippets.py test-instance --database-id example-db create_database python snippets.py test-instance --database-id example-db insert_data Run a query to read the values of all columns from the Albums table: gcloud spanner databases execute-sql example-db --instance=test-instance --sql='SELECT SingerId, AlbumId, AlbumTitle FROM Albums'

QUIZ: Managed Databases on Google Cloud

00:30:00

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

00:30:00

Data Engineering Architecture

Real Time Messaging with Cloud Pub/Sub

Streaming Data Challenges

00:08:24

Lesson Description:

In this lesson, we will cover the challenges associated with reliably capturing streaming data, tightly vs. loosely coupled systems, and how Cloud Pub/Sub helps to resolve this issue.

Cloud Pub/Sub Overview

00:12:28

Lesson Description:

We are now going to take a detailed look at what exactly Pub/Sub does, how the process of publishing and subscribing to topics work, and many other points necessary for a thorough understanding that may also be on the exam.

Pub/Sub Hands On

00:18:24

Lesson Description:

It is now time to jump into a hands-on demonstration of Cloud Pub/Sub using both the web console and gcloud command line. For your reference, the steps we took in the demonstration are also below to follow along. Simple Demonstration: Create a topic called my-topic: gcloud pubsub topics create my-topic Create a subscription to topic my-topic: gcloud pubsub subscriptions create --topic my-topic mySub1 Publish a message to your topic: gcloud pubsub topics publish my-topic --message "hello" Retrieve message with your subscription, acknowledge receipt and remove message from the queue: gcloud pubsub subscriptions pull --auto-ack mySub1 Cancel subscription: gcloud pubsub subscriptions delete mySub1 Simulated traffic ingest: Clone GitHub data to Cloud Shell (or other SDK environments), and browse to the publish folder: cd ~ git clone https://github.com/linuxacademy/googledataengineer cd ~/googledataengineer/courses/streaming/publish Create a topic called sandiego: gcloud pubsub topics create sandiego Create a subscription to topic sandiego: gcloud pubsub subscriptions create --topic sandiego mySub1 Run script to download sensor data: ./download_data.sh (Optional). If you need to authenticate a shell to ensure we have the right permissions: gcloud auth application-default login View script info: vim ./send_sensor_data.py or use viewer of your choice Run python script to simulate one hour of data per minute: ./send_sensor_data.py --speedFactor=60 --project=YOUR-PROJECT-ID If you receive an error: google.cloud.pubsub cannot be found OR ‘ImportError: No module named iterator’, run the below pip command to install components then try again: sudo pip install -U google-cloud-pubsub Open new Cloud Shell tab (using + symbol): Pull the message using the subscription mySub1 gcloud pubsub subscriptions pull --auto-ack mySub1 Create a new subscription and pull messages with it: gcloud pubsub subscriptions create --topic sandiego mySub2 gcloud pubsub subscriptions pull --auto-ack mySub2

Connecting Kafka to GCP

00:05:13

Lesson Description:

Existing on premises workloads may need to use an existing Apache Kafka cluster to connect to GCP. This lesson will cover the basics of how to integrate Kafka with GCP, focusing on how connectors work. If you would like to learn more about Kafka connectors for GCP, you can find more info at the below link: https://cloud.google.com/blog/products/gcp/apache-kafka-for-gcp-users-connectors-for-pubsub-dataflow-and-bigquery

Monitoring Subscriber Health

00:09:30

Lesson Description:

This lesson will cover the basics of monitoring your Pub/Sub topics and subscriptions. We want to avoid backlog of undelivered messages in Pub/Sub, which includes monitoring subscriber health. We will cover those concepts in this lesson.

Data Pipelines with Cloud Dataflow

Data Processing Pipelines

00:05:24

Lesson Description:

We will now start a discussion on the data processing pipelines using Cloud Dataflow. In this first lesson, we will go over what data processing is in a general sense, as well as challenges that must be dealt with using modern data processing techniques.

Cloud Dataflow Overview

00:10:09

Lesson Description:

We will take a closer look at many concepts that are necessary to understand Cloud Dataflow.

Key Concepts

00:09:43

Lesson Description:

In this lesson, we will cover key concepts and terminology you may see on the exam. We will also discuss how to deal with late/out of order data using Dataflow windows, watermarks, and triggers.

Template Hands On

00:11:08

Lesson Description:

In the first of our two hands-on demonstrations, we are going to use a predefined template to conduct a word count on a classic Shakespeare play using Dataflow.

Streaming Ingest Pipeline Hands On

00:20:03

Lesson Description:

We are going to demonstrate how to take our streaming ingest of traffic sensor data from the previous section, and run it through a Dataflow pipeline to calculate average speeds and output it to BigQuery. The command line reference for what we are demonstrating is below. Create BigQuery dataset for processing pipeline output: bq mk --dataset $DEVSHELL_PROJECT_ID:demos Create Cloud Storage bucket for Dataflow staging: gsutil mb gs://$DEVSHELL_PROJECT_ID Create a topic and publish messages: cd ~/googledataengineer/courses/streaming/publish gcloud pubsub topics create sandiego ./download_data.sh sudo pip install -U google-cloud-pubsub ./send_sensor_data.py --speedFactor=60 --project=$DEVSHELL_PROJECT_ID Open a new Cloud Shell tab Browse to the Dataflow directory and run the script to create a pipeline, passing along our project ID, storage bucket, and Average Speeds file to construct the pipeline. cd ~/googledataengineer/courses/streaming/process/sandiego ./run_oncloud.sh $DEVSHELL_PROJECT_ID $DEVSHELL_PROJECT_ID AverageSpeeds

Additional Best Practices

00:10:11

Lesson Description:

In this lesson we will cover additional best practices and topics such as gracefully handling input errors, a further exploration of Dataflow's windows, and updating jobs.

Dataproc

Dataproc Overview

00:10:48

Lesson Description:

In this lesson, we will introduce ourselves to Google Cloud Dataproc, the Hadoop ecosystem, and how it fits in the larger picture of the Google Cloud data lifecycle.

Configure Dataproc Cluster and Submit Job – Part 1

00:15:35

Lesson Description:

In this part 1 of a two-part hands-on demonstration, we will go over creating a Dataproc cluster and submitting a sample Spark job to the cluster. Google-provided Dataproc initialization scripts: https://console.cloud.google.com/storage/browser/dataproc-initialization-actions/ Reference for example job submission: https://cloud.google.com/dataproc/docs/quickstarts/quickstart-console Creating a dataproc cluster: gcloud dataproc clusters create [cluster_name] --zone [zone_name] Note - though we installed Kafka via an initialization script in this lesson as an example, it is not actually being utilized.

Configure Dataproc Cluster and Submit Job – Part 2

00:14:35

Lesson Description:

This is the second half of our hands-on demonstration. Command line notes of actions taken in this lesson are below: Updating a cluster with new workers/preemptible machines: gcloud dataproc clusters update [cluster_name] --num-workers [#] --num-preemptible-workers [#] SOCKS proxy configuration From the local machine, SSH to the master to enable port forwarding: gcloud compute ssh master-host-name --project=project-id --zone=master-host-zone -- -D 1080 -N Open new terminal window and launch the web browser with parameters (varies by OS/browser): "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --proxy-server="socks5://localhost:1080" --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" --user-data-dir=/tmp/cluster1-m Browse to http://[master]:port8088 - Hadoop9870 - HDFSUsing Cloud Shell (must use for each port) gcloud compute ssh master-host-name --project=project-id --zone master-host-zone -- -4 -N -L port1:master-host-name:port2 Use Web Preview to choose port (8088/9870).

Migrating and Optimizing for Google Cloud

00:09:49

Lesson Description:

In this lesson, we will cover the best way to migrate your Hadoop/Spark workflows to Google Cloud, and how to take the most advantage of the cloud model.

Best Practices for Cluster Performance

00:05:42

Lesson Description:

In this lesson we are going to explore some of the Dataproc service-specific best practices to keep in mind to optimize your Dataproc cluster performance.

QUIZ: Data Ingest and Processing

00:30:00

Analyzing Data and Enabling Machine Learning

BigQuery

BigQuery Overview

00:14:43

Lesson Description:

We are now going to take a look at BigQuery, which is Google's massive scale, no-ops data warehouse product. BigQuery is often the central pillar of big data solutions on Google Cloud. We will start with a general overview of BigQuery before we get into the more detailed hands-on lessons. Link to SQL Primer Course: https://linuxacademy.com/cp/modules/view/id/52 Roles Comparison Matrix: https://cloud.google.com/bigquery/docs/access-control#predefined_roles_comparison_matrix

Interacting with BigQuery

00:22:10

Lesson Description:

In this lesson, we will cover the variety of methods of interacting with BigQuery in both Web UI and command line format. We will cover both the basics and exam related topics, such as searching with wildcards and creating views.

Load and Export Data

00:19:02

Lesson Description:

In this lesson, we will cover how to load data into BigQuery, read data from external sources, and export data out of BigQuery. For reference, the below command is valid for loading birth data into a BigQuery table: bq load names.baby_names gs://(YOUR_BUCKET)/names/yob*.txt Name:STRING,Gender:STRING,Number:INTEGER

Optimize for Performance and Costs

00:15:29

Lesson Description:

Better performance and saving money are great things. Following best practices for constructing your BigQuery queries can do both. We will look at how to optimize performance and costs as well as explore how to interpret query details and the breakdown between stages.

Streaming Insert Example

00:08:38

Lesson Description:

We are going to revisit our streaming Dataflow/PubSub pipeline and take a closer look at the end result, which is a streaming insert into BigQuery. For easy setup, the below commands and scripts will quickly create your pipeline so we can focus on BigQuery, as well as cleanup when finished. Quick setup: cd gsutil cp -r gs://gcp-course-exercise-scripts/data-engineer/* . bash streaming-insert.sh Clean up: bash streaming-cleanup.sh Manually stop the Dataflow job.

BigQuery Logging and Monitoring

00:08:18

Lesson Description:

This lesson will cover which Stackdriver products to use for which use case scenarios for keeping tabs on your BigQuery environment.

BigQuery Best Practices

00:14:53

Lesson Description:

This lesson will cover an assortment of best practices for working with BigQuery that build on the previous topics in this lesson to help control costs, improve performance, and protect your data.

QUIZ: BIGQUERY

00:30:00

Machine Learning

What is Machine Learning?

00:14:45

Lesson Description:

In this lesson, we will go over the basic concepts of machine learning, which will be important for the next few sections. Machine learning is the process of training a machine how to recognize new data by training it with similar examples.

Working with Neural Networks

00:15:08

Lesson Description:

We are going to go hands-on with a demonstration neural network and learn about how all of the different pieces fit together. This is going to be a terminology-heavy lesson, so be sure to review all the concepts in the associated Lucidchart. The link for the hands-on Tensorflow playground we work with in this lesson is located at: https://playground.tensorflow.org

Preventing Overfitted Training Data

00:07:40

Lesson Description:

Overfitting occurs when your training model is to specified on your training data, and unable to generalize for external data. In this lesson we are going to cover the causes of overfitting, solutions, and the difference between L1 and L2 Regularization.

AI Platform (Formerly Cloud ML Engine)

GCP Machine Learning Services

00:05:57

Lesson Description:

In this lesson, we will cover two primary methods that GCP uses to assist with your machine learning needs. We will cover AI Platform, which provides fully managed resources to train and deploy machine learning models, and pre-trained API's which provide a 'plug and play' solution for application developers.

AI Platform Overview

00:16:52

Lesson Description:

In this lesson, we will take a close look at AI Platform before we go into some hands on demonstrations.

AI Platform Hands On Part 1

00:15:04

Lesson Description:

In these two hands on demonstrations, we will go over the process of training a pre-packaged machine learnng model on AI Platform, deploying the trained model, and running predictions against it. All commands used in these two demonstrations are below. If you want a set of scripts to automate much of the process, the scripts and a PDF file of the commands we used can be found at the following public link: https://console.cloud.google.com/storage/browser/gcp-course-exercise-scripts/data-engineer/ai-platform If you want to download all scripts and files in a terminal environment, use the below command: gsutil cp gs://gcp-course-exercise-scripts/data-engineer/ai-platform/* .(The period at the end is required to copy to your current location). Full guide below: Set region environment variable REGION=us-central1Download and unzip ML github data for demo wget https://github.com/GoogleCloudPlatform/cloudml-samples/archive/master.zip unzip master.zipNavigate to the cloudml-samples-master > census > estimator directory. ALL Commands must be run from this directory cd ~/cloudml-samples-master/census/estimatorDevelop and validate trainer on local machine Get training data from public GCS bucket mkdir data gsutil -m cp gs://cloud-samples-data/ml-engine/census/data/* data/Set path variables for local file paths These will change later when we use AI Platform TRAIN_DATA=$(pwd)/data/adult.data.csv EVAL_DATA=$(pwd)/data/adult.test.csvRun sample requirements.txt to ensure we're using same version of TF as sample sudo pip install -r ~/cloudml-samples-master/census/requirements.txtRun a local trainer Specify output directory, set as variable MODEL_DIR=outputBest practice is to delete contents of output directory in case data remains from previous training run rm -rf $MODEL_DIR/*Run local training using gcloud gcloud ai-platform local train --module-name trainer.task --package-path trainer/ --job-dir $MODEL_DIR -- --train-files $TRAIN_DATA --eval-files $EVAL_DATA --train-steps 1000 --eval-steps 100Run trainer on GCP AI Platform - single instance Create regional Cloud Storage bucket used for all output and staging gsutil mb -l $REGION gs://$DEVSHELL_PROJECT_ID-aip-demoUpload training and test/eval data to bucket cd ~/cloudml-samples-master/census/estimator gsutil cp -r data gs://$DEVSHELL_PROJECT_ID-aip-demo/dataSet data variables to point to storage bucket files TRAIN_DATA=gs://$DEVSHELL_PROJECT_ID-aip-demo/data/adult.data.csv EVAL_DATA=gs://$DEVSHELL_PROJECT_ID-aip-demo/data/adult.test.csvCopy test.json to storage bucket gsutil cp ../test.json gs://$DEVSHELL_PROJECT_ID-aip-demo/data/test.jsonSet TEST_JSON to point to the same storage bucket file TEST_JSON=gs://$DEVSHELL_PROJECT_ID-aip-demo/data/test.jsonSet variables for job name and output path JOB_NAME=census_single_1 OUTPUT_PATH=gs://$DEVSHELL_PROJECT_ID-aip-demo/$JOB_NAMESubmit a single process job to AI Platform Job name is JOB_NAME (census_single_1) Output path is our Cloud storage bucket/job_name Training and evaluation/test data is in our Cloud Storage bucket gcloud ai-platform jobs submit training $JOB_NAME --job-dir $OUTPUT_PATH --runtime-version 1.4 --module-name trainer.task --package-path trainer/ --region $REGION -- --train-files $TRAIN_DATA --eval-files $EVAL_DATA --train-steps 1000 --eval-steps 100 --verbosity DEBUGCan view streaming logs/output with gcloud ai-platform jobs stream-logs $JOB_NAME When complete, inspect output path with gsutil ls -r $OUTPUT_PATH Run distributed training on AI Platform Create variable for distributed job name cd ~/cloudml-samples-master/census/estimator JOB_NAME=census_dist_1Set new output path to Cloud Storage location using new JOB_NAME variable OUTPUT_PATH=gs://$DEVSHELL_PROJECT_ID-aip-demo/$JOB_NAMESubmit a distributed training job The '--scale-tier STANDARD_1' option is the new item that initiates distributed scaling gcloud ai-platform jobs submit training $JOB_NAME --job-dir $OUTPUT_PATH --runtime-version 1.4 --module-name trainer.task --package-path trainer/ --region $REGION --scale-tier STANDARD_1 -- --train-files $TRAIN_DATA --eval-files $EVAL_DATA --train-steps 1000 --verbosity DEBUG --eval-steps 100Prediction phase (testing it out) Deploy a model for prediction, setting variables in the process cd ~/cloudml-samples-master/census/estimator MODEL_NAME=censusCreate the ML Engine model gcloud ai-platform models create $MODEL_NAME --regions=$REGIONSet the job output we want to use. This example uses census_dist_1 OUTPUT_PATH=gs://$DEVSHELL_PROJECT_ID-aip-demo/census_dist_1 CHANGE census_dist_1 to use a different output from previous IMPORTANT - Look up and set full path for export trained model binaries gsutil ls -r $OUTPUT_PATH/export Look for directory $OUTPUT_PATH/export/census/ and copy/paste timestamp value (without colon) into the below command MODEL_BINARIES=gs://$DEVSHELL_PROJECT_ID-aip-demo/census_dist_1/export/census/<timestamp> ###CHANGE ME! Create version 1 of your model gcloud ai-platform versions create v1 --model $MODEL_NAME --origin $MODEL_BINARIES --runtime-version 1.4Send an online prediction request to our deployed model using test.json file Results come back with a direct response gcloud ai-platform predict --model $MODEL_NAME --version v1 --json-instances ../test.jsonSend a batch prediction job using same test.json file Results are exported to a Cloud Storage bucket location Set job name and output path variables JOB_NAME=census_prediction_1 OUTPUT_PATH=gs://$DEVSHELL_PROJECT_ID-aip-demo/$JOB_NAMESubmit the prediction job gcloud ai-platform jobs submit prediction $JOB_NAME --model $MODEL_NAME --version v1 --data-format TEXT --region $REGION --input-paths $TEST_JSON --output-path $OUTPUT_PATH/predictionsView results in web console at gs://$DEVSHELL_PROJECT_ID-aip-demo/$JOB_NAME/predictions/

AI Platform Hands On Part 2

00:15:37

Lesson Description:

In these two hands on demonstrations, we will go over the process of training a pre-packaged machine learnng model on AI Platform, deploying the trained model, and running predictions against it. All commands used in these two demonstrations are below. If you want a set of scripts to automate much of the process, the scripts and a PDF file of the commands we used can be found at the following public link: https://console.cloud.google.com/storage/browser/gcp-course-exercise-scripts/data-engineer/ai-platform If you want to download all scripts and files in a terminal environment, use the below command: gsutil cp gs://gcp-course-exercise-scripts/data-engineer/ai-platform/* .(The period at the end is required to copy to your current location). Full guide below: Set region environment variable REGION=us-central1Download and unzip ML github data for demo wget https://github.com/GoogleCloudPlatform/cloudml-samples/archive/master.zip unzip master.zipNavigate to the cloudml-samples-master > census > estimator directory. ALL Commands must be run from this directory cd ~/cloudml-samples-master/census/estimatorDevelop and validate trainer on local machine Get training data from public GCS bucket mkdir data gsutil -m cp gs://cloud-samples-data/ml-engine/census/data/* data/Set path variables for local file paths These will change later when we use AI Platform TRAIN_DATA=$(pwd)/data/adult.data.csv EVAL_DATA=$(pwd)/data/adult.test.csvRun sample requirements.txt to ensure we're using same version of TF as sample sudo pip install -r ~/cloudml-samples-master/census/requirements.txtRun a local trainer Specify output directory, set as variable MODEL_DIR=outputBest practice is to delete contents of output directory in case data remains from previous training run rm -rf $MODEL_DIR/*Run local training using gcloud gcloud ai-platform local train --module-name trainer.task --package-path trainer/ --job-dir $MODEL_DIR -- --train-files $TRAIN_DATA --eval-files $EVAL_DATA --train-steps 1000 --eval-steps 100Run trainer on GCP AI Platform - single instance Create regional Cloud Storage bucket used for all output and staging gsutil mb -l $REGION gs://$DEVSHELL_PROJECT_ID-aip-demoUpload training and test/eval data to bucket cd ~/cloudml-samples-master/census/estimator gsutil cp -r data gs://$DEVSHELL_PROJECT_ID-aip-demo/dataSet data variables to point to storage bucket files TRAIN_DATA=gs://$DEVSHELL_PROJECT_ID-aip-demo/data/adult.data.csv EVAL_DATA=gs://$DEVSHELL_PROJECT_ID-aip-demo/data/adult.test.csvCopy test.json to storage bucket gsutil cp ../test.json gs://$DEVSHELL_PROJECT_ID-aip-demo/data/test.jsonSet TEST_JSON to point to the same storage bucket file TEST_JSON=gs://$DEVSHELL_PROJECT_ID-aip-demo/data/test.jsonSet variables for job name and output path JOB_NAME=census_single_1 OUTPUT_PATH=gs://$DEVSHELL_PROJECT_ID-aip-demo/$JOB_NAMESubmit a single process job to AI Platform Job name is JOB_NAME (census_single_1) Output path is our Cloud storage bucket/job_name Training and evaluation/test data is in our Cloud Storage bucket gcloud ai-platform jobs submit training $JOB_NAME --job-dir $OUTPUT_PATH --runtime-version 1.4 --module-name trainer.task --package-path trainer/ --region $REGION -- --train-files $TRAIN_DATA --eval-files $EVAL_DATA --train-steps 1000 --eval-steps 100 --verbosity DEBUGCan view streaming logs/output with gcloud ai-platform jobs stream-logs $JOB_NAME When complete, inspect output path with gsutil ls -r $OUTPUT_PATH Run distributed training on AI Platform Create variable for distributed job name cd ~/cloudml-samples-master/census/estimator JOB_NAME=census_dist_1Set new output path to Cloud Storage location using new JOB_NAME variable OUTPUT_PATH=gs://$DEVSHELL_PROJECT_ID-aip-demo/$JOB_NAMESubmit a distributed training job The '--scale-tier STANDARD_1' option is the new item that initiates distributed scaling gcloud ai-platform jobs submit training $JOB_NAME --job-dir $OUTPUT_PATH --runtime-version 1.4 --module-name trainer.task --package-path trainer/ --region $REGION --scale-tier STANDARD_1 -- --train-files $TRAIN_DATA --eval-files $EVAL_DATA --train-steps 1000 --verbosity DEBUG --eval-steps 100Prediction phase (testing it out) Deploy a model for prediction, setting variables in the process cd ~/cloudml-samples-master/census/estimator MODEL_NAME=censusCreate the ML Engine model gcloud ai-platform models create $MODEL_NAME --regions=$REGIONSet the job output we want to use. This example uses census_dist_1 OUTPUT_PATH=gs://$DEVSHELL_PROJECT_ID-aip-demo/census_dist_1 CHANGE census_dist_1 to use a different output from previous IMPORTANT - Look up and set full path for export trained model binaries gsutil ls -r $OUTPUT_PATH/export Look for directory $OUTPUT_PATH/export/census/ and copy/paste timestamp value (without colon) into the below command MODEL_BINARIES=gs://$DEVSHELL_PROJECT_ID-aip-demo/census_dist_1/export/census/<timestamp> ###CHANGE ME! Create version 1 of your model gcloud ai-platform versions create v1 --model $MODEL_NAME --origin $MODEL_BINARIES --runtime-version 1.4Send an online prediction request to our deployed model using test.json file Results come back with a direct response gcloud ai-platform predict --model $MODEL_NAME --version v1 --json-instances ../test.jsonSend a batch prediction job using same test.json file Results are exported to a Cloud Storage bucket location Set job name and output path variables JOB_NAME=census_prediction_1 OUTPUT_PATH=gs://$DEVSHELL_PROJECT_ID-aip-demo/$JOB_NAMESubmit the prediction job gcloud ai-platform jobs submit prediction $JOB_NAME --model $MODEL_NAME --version v1 --data-format TEXT --region $REGION --input-paths $TEST_JSON --output-path $OUTPUT_PATH/predictionsView results in web console at gs://$DEVSHELL_PROJECT_ID-aip-demo/$JOB_NAME/predictions/

Pretrained Machine Learning API's

Pre-trained ML API's

00:09:10

Lesson Description:

We are going to move on to Google's 'plug and play' machine learning API's that you can insert into your own application. We will follow up this lesson with a hands-on demonstration of using the Vision API service.

Vision API Demo

00:13:28

Lesson Description:

We will demonstrate working with the Vision API service in this lesson. The commands we used to authenticate with our API key, create request.json file, and the curl command used to call on the API are below for your reference. Authenticate with your API key after creating the API credential: export API_KEY=(your copied key) Create your JSON file by typing: vim request.json Then copy/paste the following text, substituting the Cloud Storage info with your actual Cloud Storage bucket and object info: "requests": [ { "image": { "source": { "gcsImageUri": "gs://(your_bucket)/(your_image_file)" } }, "features": [ { "type": "LABEL_DETECTION", "maxResults": 10 }, { "type": "WEB_DETECTION", "maxResults": 10 }, { "type": "FACE_DETECTION" } ] } ] } Use the below command to authenticate with the Vision API using your API key, and supplying the request.json file we just created: curl -s -X POST -H "Content-Type: application/json" --data-binary @request.json https://vision.googleapis.com/v1/images:annotate?key=${API_KEY}

QUIZ: MACHINE LEARNING ON GOOGLE CLOUD

00:30:00

Datalab

Datalab Overview

00:08:45

Lesson Description:

We are going to explore Datalab, which is built on Jupyter notebooks. They are very popular for interactive data science/engineering, and visually exploring python code. NOTE: since this course was published, the Service Account Actor role has been deprecated, the Service Account User should be used instead (https://cloud.google.com/iam/docs/service-accounts#the_service_account_actor_role) However, Datalab docs still state to use the service account actor role, so we're stating both answers as technically correct for the time being with the understanding that the actor role will be phased out.

Datalab Demo

00:17:48

Lesson Description:

In this lesson, we will go hands-on with Datalab, which is actually quite fun and interesting. To create a Datalab notebook using Cloud Shell, type the below command: datalab create (instance-name) To connect to a Datalab notebook, type: datalab connect (instance-name)

Data Visualization

Cleaning Your Data with Dataprep

What is Dataprep?

00:08:58

Lesson Description:

We are now going to discuss Cloud Dataprep, which is an intuitive web-based interface for cleaning and preparing data for use, backed by Cloud Dataflow.

Dataprep Demo Part 1

00:14:15

Lesson Description:

This is the first in a three-part demo of Dataprep. The public cloud storage bucket we used for our datasets is gs://dataprep-samples.

Dataprep Demo Part 2

00:16:32

Lesson Description:

This is the second portion of our fairly extensive Dataprep demo.

Dataprep Demo Part 3

00:11:48

Lesson Description:

This is the third part of our Dataprep demo. In this lesson, we will take our joined datasets, and start a managed Dataflow job, which we will then insert into a BigQuery table.

Building Data Visualizations with Data Studio

Data Studio Introduction

00:09:31

Lesson Description:

This section will focus on Data Studio, which allows anyone to create interactive report dashboards from data sources for free. We will go over the primary concepts, followed by a hands-on demonstration.

Data Studio Demo

00:28:19

Lesson Description:

In this lesson, we will demonstrate working with Data Studio. We will take the cleaned Dataprep document from the previous section which we exported to BigQuery, and use that data source to get insights about our ratings.

QUIZ: DATALAB/DATAPREP/DATA STUDIO

00:30:00

Monitoring and Orchestration

Orchestrating Data Workflows with Cloud Composer

Cloud Composer Overview

00:08:26

Lesson Description:

Cloud Composer is a fully managed implementation of Apache Airflow, which enables you to automatically orchestrate and monitor various big data workflows. We will cover the primary concepts of Cloud Composer, to be followed up with a hands on demonstration in the next lesson.

Hands On - Cloud Composer

00:15:25

Lesson Description:

This lesson will be a hands on tour of Cloud Composer in action. The commands we used in this lesson are duplicated below: Create GCS bucket to output Dataproc results (using Project ID shell variable) gsutil mb -l us-central1 gs://output-$DEVSHELL_PROJECT_ID Set our Cloud Composer variables necessary for our workflow. Project ID will again be represented by the shell variable to auto-resolve to your unique Project ID: gcloud composer environments run my-environment --location us-central1 variables -- --set gcp_project $DEVSHELL_PROJECT_ID gcloud composer environments run my-environment --location us-central1 variables -- --set gcs_bucket gs://output-$DEVSHELL_PROJECT_ID gcloud composer environments run my-environment --location us-central1 variables -- --set gce_zone us-central1-c Upload the example DAG file to the DAG folder for Cloud Composer. A copy of the DAG file can be found at this link. Direct link for local download: https://storage.googleapis.com/la-gcloud-course-resources/data-engineer/cloud-composer/quickstart.py Public GCS location: gs://la-gcloud-course-resources/data-engineer/cloud-composer/quickstart.py GitHub link: https://github.com/GoogleCloudPlatform/python-docs-samples/blob/b80895ed88ba86fce223df27a48bf481007ca708/composer/workflows/quickstart.py

Course Conclusion

Final Steps

Additional Study Resources

00:02:56

Lesson Description:

Let's go over some additional study resources that cover external technologies required to know for the exam. SQL deep dive Course - SQL Primer https://linuxacademy.com/cp/modules/view/id/52 Machine Learning Google Machine Learning Crash Course (free) https://developers.google.com/machine-learning/crash-course/ Hadoop Hadoop Quick Start https://linuxacademy.com/cp/modules/view/id/294 Apache Beam (Dataflow) Google's guide to designing your pipeline with Apache Beam (using Java) https://cloud.google.com/dataflow/docs/guides/beam-creating-a-pipeline

Additional Hands On and Practice Resources

00:04:43

Lesson Description:

Below are additional resources directly from Google that I highly recommend working with for even more hands-on practice: Official Google Cloud Data Engineer practice exam: https://cloud.google.com/certification/practice-exam/data-engineer Google Cloud Solutions Center: https://cloud.google.com/solutions/ Google Codelabs for hands-on practice across all products: https://codelabs.developers.google.com/ Big Data and Machine Learning Blog: https://cloud.google.com/blog/big-data/ BigQuery Tutorials: https://cloud.google.com/bigquery/docs/tutorials Dataflow Tutorials: https://cloud.google.com/dataflow/examples/examples-beam Dataproc samples and tutorials: https://cloud.google.com/dataproc/docs/tutorials Pub/Sub totorials: https://cloud.google.com/pubsub/docs/tutorials Cloud ML Engine tutorials: https://cloud.google.com/ml-engine/docs/tensorflow/tutorials Bigtable tutorials and samples: https://cloud.google.com/bigtable/docs/samples Cloud Spanner tutorials: https://cloud.google.com/spanner/docs/tutorials

What's Next After Certification?

00:03:41

Lesson Description:

Congratulations on making it to the end of this course. Here are your next steps to get you prepared for the Data Engineer certification exam.

Get Recognized!

00:01:01

Lesson Description:

How to get recognized for your certification.

Data Engineer - Final Exam

02:00:00