Skip to main content

Google Cloud DevOps and SREs

Course

Intro Video

Photo of Joseph Lowery

Joseph Lowery

Google Cloud Training Architect II in Content

Length

02:00:00

Difficulty

Advanced

Videos

20

Course Details

Google Cloud DevOps and SREs

Welcome to the Google Cloud DevOps and SREs course. This course is the second in the Google Professional Cloud DevOps Engineer certification path. If you're coming from the traditional DevOps world, or even from the general computing world, you're likely not familiar with the abbreviation SRE. SRE stands for Site Reliability Engineering and it's the Google method for realizing DevOps or, in the more formal software speak, "class SRE implements DevOps."

Besides SRE, this field introduces a metric ton of abbreviations: SLI, SLO, SLA — not to mention some weird sounding phrases such as "error budget" and "toil." During this course, I'll explain what each of these terms means, how they interconnect, and how they relate to the concept of DevOps.

The SRE approach is quite the quantitative one. But don't worry; I'll explore the exact formulas you'll need to calculate baseline values for each of the key criteria. I'll help you see how Google maximizes the engineering velocity of developer teams while keeping products reliable.

In order to balance development and operations, you need to keep an eagle eye on operations. We'll dive into the various SRE strategies for monitoring reliability with special attention to alerting capabilities. Critically, we'll spend a good amount of time exploring the best way to handle the inevitable issues and incidents that are part of any service lifecycle.

And it's not just me here to help you out. My colleague, Mattias Andersson, will stop by at the end of every section for a quick recap and perhaps a slightly different perspective on the topics covered.

We recommend you have an Associate Cloud Engineer level certification before taking this course.

If the world of DevOps in general or Site Reliability Engineering specifically is new to you – whether or not you're on the certification path – be sure to take this course before diving into our development and operations offerings. It's designed to lay the foundation you'll need before you get hands-on.

Syllabus

Introduction

Upcoming Lesson: About the Course and Learning Path

Lesson Description:

This short video gives you an overview of the course as well as introduction to the Training Architect, Joseph Lowery. This course is the second one in the Google Professional Cloud DevOps Engineer certification path. If you are following this certification path and have not viewed the first lesson, Google Professional Cloud DevOps Engineer Certification Path Introduction, please take that course before proceeding.

About the Training Architects

00:01:08

Lesson Description:

Meet the training architect of this course, Joseph Lowery. Joe has been working with Google Cloud for over five years, transitioning websites to the cloud via App Engine, Compute Engine, Cloud Storage, Cloud Datastore, and other services. He is Linux Academy's training architect for Google Cloud Essentials, Google Kubernetes Engine Deep Dive, Google App Engine Deep Dive, Google Cloud Functions Deep Dive, and Google Cloud Apigee Certified API Engineer, as well as a full slate of hands-on labs for Google Cloud.

Upcoming Lesson: Milestone: Getting Started?

Lesson Description:

What is the context for SRE that we should already know? And what's coming up?

Balancing Change, Velocity, and Service Reliability with SREs

Big Picture: What Is Site Reliability Engineering?

00:13:06

Lesson Description:

Almost immediately when you first start investigating DevOps on Google Cloud —especially if you are preparing for the Google Professional Cloud DevOps Engineer certification — you'll come across the term Site Reliability Engineering (SRE). SRE is the Google-originated, and adhered-to, practice that implements the DevOps philosophy. This lesson explores the fundamentals of SRE and how it relates to DevOps, setting the stage for a deeper dive into the key aspects of the SRE methodology.

Understanding SLIs

00:12:42

Lesson Description:

As we discovered in the previous lessons, measurements are critical in both the world of DevOps as well as SRE. In this lesson, you'll see how those measurements are put to use in the creation of Service Level Indicators, or SLIs. In a very real sense, SLIs are at the heart of Site Reliability, Engineering as they form the baseline for targets that drive operations and thus allow for new features to be developed — so let's dive right in!

Understanding SLOs

00:06:59

Lesson Description:

Now that we've found a way to measure indicators that reflects the user's journeys regarding our application with SLIs, we need to transform those indicators into achievable targets. This process involves creating Service Level Objectives (SLOs). In this lesson, I'll explore why SLOs are an important tool for every organization employing SRE principals and show you more specifically how they relate to SLIs.

Understanding SLAs

00:08:11

Lesson Description:

We come now to explore the third element of the SRE triumvirate: Service Level Agreements, or SLAs. An SLA builds upon the SLO, which, in turn, builds upon an SLI, but remains quite distinct. The "agreement" in a Service Level Agreement is between the company providing the service and its customers. And while this connection may seem to be outside the scope of a cloud computing engineer's duties, it has a direct effect on them. In this lesson, I'll delve into SLAs, their relationship to SLOs, and why they matter to any site reliability engineer.

Upcoming Lesson: Milestone: Oh My!

Lesson Description:

To achieve effective teamwork and intentional improvement, you need to agree on definitions and targets. SLIs, SLOs, and SLAs, indeed! And all for what? To drive at the core purpose of SRE, of course!

Making the Most of Risk

Upcoming Lesson: Setting Error Budgets

Lesson Description:

In the SRE world, error budgets are a key link between dev and ops teams as a well-kept error budget keeps the service running smoothly while allowing time for enhancements to be pushed live throughout the year. On the other hand, a depleted error budget will stop product launches dead in their tracks. In this lesson, I'll explain exactly what an error budget is, how it is calculated, and how it is applied to sound Site Reliability Engineering principles.

Defining and Reducing Toil

00:08:05

Lesson Description:

The word "toil" is very evocative and brings up images of grungy, almost pointless, work. In the realm of the SRE, toil is definitely work, but it's work that presents an opportunity: reduce toil and you improve the operation of your service. In this lesson, we'll spend some time clearing defining what is and what isn't toil. I'll also go over the benefits of reducing toil as well as the best tactics for accomplishing that goal.

Upcoming Lesson: Milestone: Risky Business

Lesson Description:

Risk can be a good thing when it's properly managed. Let's recap this section and see what it means for the bigger picture of SRE. Let's also look ahead to how measuring gives us the information we need to do that proper risk management.

Generating SRE Metrics

Monitoring Reliability

00:06:55

Lesson Description:

A big part of any Site Reliability Engineer's job is to keep an eye on the status of the current system, not only to track SLIs, but also make sure the site is indeed reliable. In this lesson, we'll examine how monitoring is critical, the various aspects of the process, and just how it is implemented on Google Cloud.

Alerting Principles

00:08:40

Lesson Description:

Alerts are an absolute must-have to keep a reliable site, reliable. Problems, whether with the underlying infrastructure or inherent in the coding, are bound to arise and the sooner they are dealt with the better. An alert literally sounds the alarm but you don't want to ring that bell unless it's necessary because too many pagers going off in the middle of the night can make for some serious grumpy operations personnel. In this lesson, we'll discuss how to strike the right balance with your alerts to keep both users and staff reliably happy.

Investigating SRE Tools

00:04:01

Lesson Description:

Understanding the theory of Site Reliability Engineering and how it relates to DevOps is one thing, but how do you make it happen? You might remember that both DevOps and SRE put a heavy emphasis on unifying the whole team by using common tooling. In this lesson, we'll take a quick look at the various services commonly employed by Google SREs to achieve their goals.

Upcoming Lesson: Milestone: I See You!

Lesson Description:

You can't know what's going on if you're not paying attention, right? And for your system, that means measuring things. This gives you the info you need to make better decisions and to handle the inevitable incidents that will happen.

Reacting to Incidents

Handling Incident Response

00:09:44

Lesson Description:

As has been noted throughout this course, problems arise—and sometimes these problems are significant and can bring your service to a halt. When issues escalate, you and your team need to be ready to respond in a proscribed and detailed manner. In this lesson, we'll take a look at the right and the wrong way to react to an evolving incident, point out when to escalate an issue, and best practices for responding.

Managing Service Lifecycle

00:06:20

Lesson Description:

Services don't appear out of nowhere, ready to use. There's a clear pattern that starts with the initial idea and goes all the way to end of life. And, despite what you might think , Site Reliability Engineers aren't just involved when the service is live; they have a role to play throughout the entire span of the service's existence. In this lesson, we'll cover the potential engagement scenarios for SREs from the architecture and design phase all the way to depreciation, detailing their options and possibilities.

Ensuring Healthy Operations Collaboration

00:06:39

Lesson Description:

Operations is definitely a team sport. It's important that every member of the team be able to function to their fullest capacity and honestly be able to relay their actions at all times, but especially in response to an incident. In this lesson, we'll look closely at how postmortems—a key SRE pillar—are integrated into the on-going collaborative efforts of the operations team.

Upcoming Lesson: Milestone: Incidents R Us

Lesson Description:

How can we avoid having any incidents? Haha! Trick question! We will have incidents, and that's ok. But we need to have a solid process for both handling them in the moment and for improving our own systems and processes based on the feedback they represent about how things can go wrong.

Next Steps

Upcoming Lesson: Milestone and Continuity

Lesson Description:

Is that it? Are we all done? Well, I guess it really depends on your perspective. Let's take a look at your options.

Take this course and learn a new skill today.

Transform your learning with our all access plan.

Start 7-Day Free Trial