Skip to main content

DevOps Monitoring Deep Dive


Intro Video

Photo of Elle Krout

Elle Krout

Content Team Lead in Content

Elle is a Course Author at Linux Academy and Cloud Assessments with a focus on DevOps and Linux. She's a SaltStack Certified Engineer, and particularly enjoys working with configuration management. Prior to working as a Course Author, she was Linux Academy's technical writer for two years, producing and editing written content; before that, she worked in cloud hosting and infrastructure.Outside of tech, she likes cats, video games, and writing fiction.







Hands-on Labs


Course Details

In the DevOps Monitoring Deep Dive, we use Prometheus, Alertmanager, and Grafana to demonstrate monitoring concepts that we can use on any monitoring stack. We start by building a foundation of some general monitoring concepts, then get hands-on by working with common metrics across all levels of our platform. We'll exploring infrastructure monitoring by using Prometheus's Node Exporter and viewing statistic about our CPU, memory, disk, file system, basic networking, and load metrics. We'll also take a look at how to monitor any contrainers we may be using on our virtual machine. Once our infrastructure monitoring is up and running, we'll take a look at a basic Node.js application and use a Prometheus client libary to track metrics across our application. Finally, we look at how we can get the most out of our metrics by learning how to add recording and alerting rules, then building out a series of routes so any alerts we create can get to their desired endpoint. We'll also look at creating persistent dashboards with Grafana and use its various graphing options to better track our data. Interactive Diagram:


Welcome to the Course!

About the Course


Lesson Description:

Welcome to the DevOps Monitoring Deep Dive! In this course, we'll be using a popular monitoring stack to learn the concepts behind setting up successful monitoring: From considering whether to use a pull or push solution, to understanding the various metric types, to thinking scale, we'll be taking a look at monitoring on both the infrastructure and application level, as well as how we can best use the metrics we're monitoring for to gain insight into our system and make data-driven decisions.

About the Training Architect


Lesson Description:

Meet the training architect in this short video!

Environment Overview


Lesson Description:

Even though this course aims to teach practical concepts behind monitoring, we still need the tools to monitor things with! We'll be using a combination of Prometheus, Alertmanager, and Grafana — Prometheus being a pull-based monitoring and alerting solution, with Alertmanager collecting any alerts from Prometheus and pushing notifications, and Grafana compiling and collecting all our metrics to create visualizations.

Creating an Environment

Deploying the Demo Application


Lesson Description:

If we're going to have a monitoring course, we need something to monitor! Part of that is going to be our Ubuntu 18.04 host, but another equally important part is going to be a web application that already exists on the provided Playground server for this course. The application is a simple to-do list program called Forethought that uses the Express web framework to do most of the hard work for us. The application has also been Dockerized and saved as an image (also called `forethought`) and is ready for us to deploy.## Steps in This Video1. List the contents of the `forethought` directory and subdirectories:$ ls -d2. Confirm the creation of the existing Docker image:$ docker image list3. Deploy the web application to a container. Map port 8080 on the container to port 80 on the host:$ docker run --name ft-app -p 80:8080 -d forethought4. Check that the application is working correctly by visiting the server's provided URL. ## Using a Custom Environment### VagrantfileUse the following Vagrantfile to spin up an Ubuntu 18.04 server:# -*- mode: ruby -*- # vi: set ft=ruby :Vagrant.configure("2") do |config|config.vm.define "app" do |app| = "bento/ubuntu-18.04" app.vm.hostname = "app" "private_network", ip: "" endend### Preparing the EnvironmentIf using Vagrant or otherwise, follow these steps to set up an environment that mimics the one of our Cloud Playground:1. Install Docker and related packages:sudo apt-get install apt-transport-https ca-certificates curl gnupg2 software-properties-common curl -fsSL | sudo apt-key add sudo apt-key fingerprint 0EBFCD88 sudo add-apt-repository "deb [arch=amd64] bionic stable" sudo apt-get install docker-ce2. Enable sudo-less Docker:sudo usermod -aG docker vagrantSubstitute `vagrant` with whatever user you intend on using. Refresh your Bash session before continuing.3. Install Node.js and NPM:curl -sL -o sudo chmod +x sudo ./ sudo apt-get install nodejs sudo apt-get install build-essential4. Add the `forethought` application to the home directory (or whatever directory you wish to work from):sudo apt-get install git -y git clone forethought5. Create an image:cd forethought docker build -t ft-app .You can now pick up from the videos!

Prometheus Setup


Lesson Description:

Now that we have _what_ we're monitoring set up, we need to get our monitoring tool itself up and running, complete with a service file. Prometheus is a pull-based monitoring system that scrapes various metrics set up across our system and stores them in a time-series database, where we can use a web UI and the PromQL language to view trends in our data. Prometheus provides its own web UI, but we'll also be pairing it with Grafana later, as well as an alerting system.## Steps in This Video1. Create a system user for Prometheus:sudo useradd --no-create-home --shell /bin/false prometheus2. Create the directories in which we'll be storing our configuration files and libraries:sudo mkdir /etc/prometheus sudo mkdir /var/lib/prometheus3. Set the ownership of the `/var/lib/prometheus` directory:sudo chown prometheus:prometheus /var/lib/prometheus3. Pull down the `tar.gz` file from the [Prometheus downloads page]( /tmp/ wget Extract the files:tar -xvf prometheus-2.7.1.linux-amd64.tar.gz5. Move the configuration file and set the owner to the `prometheus` user: cd prometheus-2.7.1.linux-amd64 sudo mv console* /etc/prometheus sudo mv prometheus.yml /etc/prometheus sudo chown -R prometheus:prometheus /etc/prometheus6. Move the binaries and set the owner:sudo mv prometheus /usr/local/bin/ sudo mv promtool /usr/local/bin/ sudo chown prometheus:prometheus /usr/local/bin/prometheus sudo chown prometheus:prometheus /usr/local/bin/promtool7. Create the service file:sudo vim /etc/systemd/system/prometheus.serviceAdd:[Unit] Description=Prometheus[Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/prometheus --config.file /etc/prometheus/prometheus.yml --storage.tsdb.path /var/lib/prometheus/ --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries[Install] WantedBy=multi-user.targetSave and exit.8. Reload systemd:sudo systemctl daemon-reload9. Start Prometheus, and make sure it automatically starts on boot:sudo systemctl start prometheus sudo systemctl enable prometheus10. Visit Prometheus in your web browser at `PUBLICIP:9090`.

Alertmanager Setup


Lesson Description:

Monitoring is never just monitoring. Ideally, we'll be recording all these metrics and looking for trends so we can better react when things go wrong and make smart decisions. And once we have an idea of what we need to look for when things go wrong, we need to make sure we know about it. This is where alerting applications like Prometheus's standalone Alertmanager come in.## Steps in This Video1. Create the `alertmanager` system user:sudo useradd --no-create-home --shell /bin/false alertmanager2. Create the `/etc/alertmanager` directory:sudo mkdir /etc/alertmanager3. Download Alertmanager from the [Prometheus downloads page]( /tmp/ wget Extract the files:tar -xvf alertmanager-0.16.1.linux-amd64.tar.gz5. Move the binaries:cd alertmanager-0.16.1.linux-amd64 sudo mv alertmanager /usr/local/bin/ sudo mv amtool /usr/local/bin/6. Set the ownership of the binaries:sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager sudo chown alertmanager:alertmanager /usr/local/bin/amtool7. Move the configuration file into the `/etc/alertmanager` directory:sudo mv alertmanager.yml /etc/alertmanager/8. Set the ownership of the `/etc/alertmanager` directory:sudo chown -R alertmanager:alertmanager /etc/alertmanager/9. Create the `alertmanager.service` file for systemd:sudo $EDITOR /etc/systemd/system/alertmanager.service[Unit] Description=Alertmanager[Service] User=alertmanager Group=alertmanager Type=simple WorkingDirectory=/etc/alertmanager/ ExecStart=/usr/local/bin/alertmanager --config.file=/etc/alertmanager/alertmanager.yml [Install] WantedBy=multi-user.targetSave and exit.10. Stop Prometheus, and then update the Prometheus configuration file to use Alertmanager:sudo systemctl stop prometheus sudo $EDITOR /etc/prometheus/prometheus.ymlalerting: alertmanagers: - static_configs: - targets: - localhost:909311. Reload systemd, and then start the `prometheus` and `alertmanager` services:sudo systemctl daemon-reload sudo systemctl start prometheus sudo systemctl start alertmanager12. Make sure `alertmanager` starts on boot:sudo systemctl enable alertmanager13. Visit `PUBLICIP:9093` in your browser to confirm Alertmanager is working.

Grafana Setup


Lesson Description:

While Prometheus provides us with a web UI to view our metrics and craft charts, the web UI alone is often not the best solution to visualizing our data. Grafana is a robust visualization platform that will allow us to better see trends in our metrics and give us insight into what's going on with our applications and servers. It also lets us use multiple data sources, not just Prometheus, which gives us a full view of what's happening.## Steps in This Video1. Install the prerequisite package:sudo apt-get install libfontconfig2. Download and install Grafana using the `.deb` package provided on the [Grafana download page]( sudo dpkg -i grafana_5.4.3_amd64.deb3. Ensure Grafana starts at boot:sudo systemctl enable --now grafana-server4. Access Grafana's web UI by going to `IPADDRESS:3000`.5. Log in with the username `admin` and the password `admin`. Reset the password when prompted.### Add a Data Source1. Click **Add data source** on the homepage.2. Select **Prometheus**.3. Set the **URL** to `http://localhost:9090`.4. Click **Save & Test**.### Add a Dashboard1. From the left menu, return **Home**.2. Click **New dashboard**. The dashboard is automatically created.3. Click on the gear icon to the upper right.4. Set the **Name** of the dashboard to `Forethought`.5. Save the changes.

Monitoring Basics

Push or Pull


Lesson Description:

Within monitoring there is an age-old battle that puts the debate between Vim versus Emacs to shame: whether or not to use a push- or pull-based monitoring solution. And while Prometheus is a pull-based monitoring system, it's important to know your options before actually implementing your monitoring — after all, this is a course about gathering and using your monitoring data, not a course on Prometheus itself.## Pull-Based MonitoringWhen using a pull system to monitor your environments and applications, we're having the monitoring solution itself query our metrics endpoints, such as the one located at `:3000/metrics` on our Playground server itself. This is specifically our Grafana metrics, but it looks the same regardless of the endpoint.Pull-based systems allow us to better check the status of our targets, let us run monitoring from virtually anywhere, and provide us with web endpoints we can check for our metrics. That said, they are not without their concerns: Since a pull-based system is doing the scraping, the metrics might not be as "live" as an event-based push system, and if you have a particularly complicated network setup, then it might be difficult to grant the monitoring solution access to all the endpoints it needs to connect with.## Push-Based MonitoringPush-based monitoring solutions offload a lot of the "work" from the monitoring platform to the endpoints themselves: The endpoints are the ones that push their metrics up to the monitoring application. Push systems are especially useful when you need event-based monitoring, and can't wait every 15 or so seconds for the data to be pulled in. They also allow for greater modularity, offloading most of the difficult work to the clients they serve.That said, many push-based systems have greater setup requirements and overhead than pull-based ones, and the majority of the managing isn't done through only the monitoring server.## Which to ChooseDespite the debate, one system is not necessarily better than the other, and a lot of it will depend on your individual needs. Not sure which is best for you? I would suggest taking the time to set a system of either type up on a dev environment and note the pain points — because anything causing trouble on a test environment is going to cause bigger problems on production, and those issues will most likely dictate which system works best for you.

Patterns and Anti-Patterns


Lesson Description:

Unfortunately for us, there are a lot of ways to do inefficient monitoring. From monitoring the wrong thing to spending too much time setting up the coolest new monitoring tool, monitoring can often become a relentless series of broken and screaming alerts for problems we're not sure how to fix. In this lesson, we'll address some of the most common monitoring issues and think about how to avoid them.## Thinking It's About the ToolsWhile finding the right tool is important, having a select amount of carefully curated monitoring tools that suit your needs will take you much farther than simply using a tool because you heard it was the best. Never try to force your needs to fit a tool's abilities.## Falling into Cargo CultsJust because Google does it doesn't mean we should! Just as we need to think about our needs when we select our tools, we also need to think about our needs when we set them up. Ask yourself _why_ you're monitoring something the way you are, and consider how that monitoring affects your alerting. Is the CPU alarm going off because of an unknown CPU problem, or should the "application spun up too many processes" alarm be going off instead?## Net Embracing AutomationNo one should be manually enrolling their services into Prometheus — or any monitoring solution! Automating the process of enrollment from the start will allow monitoring to happen more naturally and prevent tedious, easily forgotten tasks. We also want to take the time to look at our runbooks and see what problems can have automated solutions.## Leaving One Person in ChargeMonitoring is something everyone should be at least a little considerate of — and it definitely shouldn't just be the job of one person. Instead, monitoring should be considered from the very start of a project, and any work needed to monitor a service should be planned.

Service Discovery


Lesson Description:

We've used a lot of terms interchangeably in this course up until now — _client_, _service_, _endpoint_, _target_ — but all these things are just _something we are monitoring_. And the process of our monitoring system discovering what we're monitoring is called _service discovery_. While we'll be doing it manually throughout this course (since we only have a very minimal system), in practice, we'd want to consider automating the task out by using some kind of service discovery tool.## Tool Options+ [Consul]( + [Zookeeper]( + [Nerve]( + Any service discovery tool native to your existing platform: + AWS + Azure + GCP + Kubernetes + Marathon + ... and more!

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.


Infrastructure Monitoring

Using the Node Exporter


Lesson Description:

Right now, our monitoring system only monitors itself; which, while beneficial, is not the most helpful when it comes to maintaining and monitoring all our systems as a whole. We instead have to add endpoints that will allow Prometheus to scrape data for our application, container, and infrastructure. In this lesson, we'll be starting with infrastructure monitoring by introducing Prometheus's _Node Exporter_. The Node Exporter sends system data to Prometheus via a metrics page with minimal setup on our part, leaving us to focus on more practical tasks.Much like Prometheus and Alertmanager, to add an exporter to our server, we need to do a little bit of leg work.## Steps in This Video1. Create a system user:$ sudo useradd --no-create-home --shell /bin/false node_exporter2. Download the Node Exporter from [Prometheus's download page]($ cd /tmp/ $ wget Extract its contents; note that the versioning of the Node Exporter may be different:$ tar -xvf node_exporter-0.17.0.linux-amd64.tar.gz4. Move into the newly created directory:$ cd node_exporter-0.17.0.linux-amd64/5. Move the provided binary:$ sudo mv node_exporter /usr/local/bin/6. Set the ownership:$ sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter7. Create a systemd service file:$ sudo vim /etc/systemd/system/node_exporter.service[Unit] Description=Node Exporter[Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/usr/local/bin/node_exporter[Install] WantedBy=multi-user.targetSave and exit when done.8. Start the Node Exporter:$ sudo systemctl daemon-reload $ sudo systemctl start node_exporter9. Add the endpoint to the Prometheus configuration file:$ sudo $EDITOR /etc/prometheus/prometheus.yml- job_name: 'node_exporter' static_configs: - targets: ['localhost:9100']10. Restart Prometheus:$ sudo systemctl restart prometheus11. Navigate to the Prometheus web UI. Using the expression editor, search for `cpu`, `meminfo`, and related system terms to view the newly added metrics.12. Search for `node_memory_MemFree_bytes` in the expression editor; shorten the time span for the graph to be about 30 minutes of data.13. Back on the terminal, download and run `stress` to cause some memory spikes:$ sudo apt-get install stress $ stress -m 214. Wait for about one minute, and then view the graph to see the difference in activity.## References - [Node Exporter Metrics](

CPU Metrics


Lesson Description:

_Run `stress -c 5` on your server before starting this lesson._With the Node Exporter up and running, we now have access to a number of infrastructure metrics on Prometheus, including data about our CPU. The processing power of our server determines how well basically everything on our server runs, so keeping track of its cycles can be invaluable for diagnosing problems and reviewing trends in how our applications and services are running.For almost all monitoring solutions, including Prometheus, data for this metric is pulled from the `/proc/stat` file on the host itself, and in Prometheus these metrics are provided to us in expressions that start with `node_cpu`. Assuming we're not running any guests on our host, the core expression for this that we want to review is the `node_cpu_seconds_total` metric.`node_cpu_seconds_total` works as a counter — that is, it keeps track of how long the CPU spends in each mode, in seconds, and adds it to a persistent count. Counters might not seem especially helpful on their own, but combined with the power of math, we can actually get a lot of information out of it.Most of the time, what would be helpful here is viewing the percentages and averages that our CPU spends in either the idle more or any working modes. In Prometheus, we can do this with the `rate` and `irate` queries, which calculate the per-second average change in the given time series in a range. `irate` is specifically for fast-moving counters (like our CPU); both should be used with counter-based metrics specifically.We can see what amount of time our server spends in each mode by running `irate(node_cpu_seconds_total[30s]) * 100` in the expression editor with a suggested limit of `30m`, assuming you're using a cloud playground server.Additionally, we can check for things like the percentage of time the CPU is performing userland processes:irate(node_cpu_seconds_total{mode="user"}[1m]) * 100Or we can determine averages across our entire fleet with the `avg` operator for Prometheus:avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100Other metrics to consider include the `node_cpu_guest_seconds_total` metric, which works similarly to `node_cpu_seconds_total` but is especially useful for any machine running guest virtual machines.> Remember to kill the `stress` process you started at the beginning of this lesson!

Memory Metrics


Lesson Description:

_Run `stress -m 1` on your server before starting this lesson._When it comes to looking at our memory metrics, there are a few core metrics we want to consider. Memory metrics for Prometheus and other monitoring systems are retreived through the `/proc/meminfo` file; in Prometheus in particular, these metrics are prefixed with `node_memory` in the expression editor, and quite a number of them exist. However, of the vast array of memory information we have access to, there are only a few core ones we will have to concern ourselves with much of the time: - `node_memory_MemTotal_bytes` - `node_memory_MemFree_bytes` - `node_memory_MemAvailable_bytes` - `node_memory_Buffers_bytes` - `node_memory_Cached_bytes`Those who do a bit of systems administration, incident response, and the like have probably used `free` before to check the memory of a system. The metric expressions listed above provide us with what is essentially the same data as `free` but in a time series where we can witness trends over time or compare memory between multiple system builds.`node_memory_MemTotal_bytes` provides us with the amount of memory on the server as a whole — in other words, if we have 64 GB of memory, then this would always be 64 GB of memory, until we allocate more. While on its own this is not the most helpful number, it helps us calculate the amount of in-use memory:node_memory_MemTotal_bytes - node_memory_MemFree_bytesHere, `node_memory_MemFree_bytes` denotes the amount of free memory left on the system, not including caches and buffers that can be cleared. To see the amount of _available_ memory, including caches and buffers that can be opened up, we would use `node_memory_MemAvailable_bytes`. And if we wanted to see the cache and buffer data itself, we would use `node_memory_Cached_bytes` and `node_memory_Buffers_bytes`, respectively.

Disk Metrics


Lesson Description:

_Run `stress -i 40` on your server before starting this lesson._Disk metrics are specifically related to the performance of reads and writes to our disks, and are most commonly pulled from `/proc/diskstats`. Prefixed with `node_disk`, these metrics track both the amount of data being processed during I/O operations and the amount of time these operations take, among some other features.The Node Exporter filters out any loopback devices automatically, so when we view our metric data in the expression editor, we get only the information we need without a lot of noise. For example, if we run `iostat -x` on our terminal, we'll receive detailed information about our `xvda` device on top of five `loop` devices.Now, we can collect information similar to `iostat -x` itself across a time series via our expression editor. This includes using `irate` to view the disk usage of this I/O operation across our host:irate(node_disk_io_time_seconds_total[30s])Additionally, we can use the `node_disk_io_time_seconds_total` metric alongside our `node_disk_read_time_seconds_total` and `node_disk_write_time_seconds_total` metrics to calculate the percentage of time spent on each kind of I/O operation:irate(node_disk_read_time_seconds_total[30s]) / irate(node_disk_io_time_seconds_total[30s])irate(node_disk_write_time_seconds_total[30s]) / irate(node_disk_io_time_seconds_total[30s])Additionally, we're also provided with a gauge-based metric that lets us see how many I/O operations are occurring at a point in time:node_disk_io_nowOther metrics include: - `node_disk_read_bytes_total` and `node_disk_written_bytes_total`, which track the amount of bytes read or written, respectively - `node_disk_reads_completed_total` and `node_disk_writes_completed_total`, which track the _amount_ of reads and writes - `node_disk_reads_merged_total` and `node_disk_writes_merged_total`, which track read and write merges

File System Metrics


Lesson Description:

File system metrics contain information about our _mounted_ file systems. These metrics are taken from a few different sources, but all use the `node_filesystem` prefix when we view them in Prometheus.Although most of the seven metrics we're provided here are fairly straightforward, there are some caveats we want to address — the first being the difference between `node_filesystem_avail_bytes` and `node_filesystem_free_bytes`. While for some systems these two metrics may be the same, in many Unix systems a portion of the disk is reserved for the _root_ user. In this case, `node_filesystem_free_bytes` contains the amount of free space, including the space reserved for root, while `node_filesystem_avail_bytes` contains only the available space for all users.Let's go ahead and look at the `node_filesystem_avail_bytes` metric in our expression editor. Notice how we have a number of file systems mounted that we can view: Our main `xvda` disk, the LXC file system for our container, and various temporary file systems. If we wanted to limit which file systems we view on the graph, we can uncheck the systems we're not interested in.The file system collector also supplies us with more _labels_ than we've previously seen. Labels are the key-value pairs we see in the curly brackets next to the metric. We can use these to further manipulate our data, as we saw in previous lessons. So, if we wanted to view only our temporary file systems, we can use:node_filesystem_avail_bytes{fstype="tmpfs"}Of course, these features can be used across all metrics and are not just limited to the file system. Other metrics may also have their own specific labels, much like the `fstype` and `mountpoint` labels here.

Networking Metrics


Lesson Description:

When we discuss network monitoring through the Node Exporter, we're talking about viewing networking data from a systems administration or engineering viewpoint: The Node Exporter provides us with networking device information pulled both from `/proc/net/dev` and `/sys/class/net/INTERFACE`, with `INTERFACE` being the name of the interface itself, such as `eth0`. All network metrics are prefixed with the `node_network` name.Should we take a look at `node_network` in the expression editor, we can see quite a number of options — many of these are information gauges whose data is pulled from that `/sys/class/net/INTERFACE` directory. So, when we look at `node_network_dormant`, we're seeing point-in-time data from the `/sys/class/net/INTERFACE/dormant` file.But with regards to metrics that the average user will need in terms of day-to-day monitoring, we really want to look at the metrics prepended with either `node_network_transmit` or `node_network_receive`, as this contains information about the amount of data/packets that pass through our networking, both outbound (transmit) and inbound (receive). Specifically, we want to look at the `node_network_receive_bytes_total` or `node_network_transmit_bytes_total` metrics, because these are what will help us calculate our network bandwidth:rate(node_network_transmit_bytes_total[30s]) rate(node_network_receive_bytes_total[30s])The above expressions will show us the 30-second average of bytes either transmitted or received across our time series, allowing us to see when our network bandwidth has spiked or dropped.

Load Metrics


Lesson Description:

When we talk about load, we're referencing the amount of processes waiting to be served by the CPU. You've probably seen these metrics before: They're sitting at the top of any `top` command run, and are available for us to view in the `/proc/loadavg` file. Taken every 1, 5, and 15 minutes, the load average gives us a snapshot of how hard our system is working. We can view these statistics in Prometheus at `node_load1`, `node_load5`, and `node_load15`.That said, load metrics are mostly useless from a monitoring standpoint. What is a heavy load to one server can be an easy load for another, and beyond looking at any trends in load in the time series, there is nothing we can alert on here nor any real data we can extract through queries or any kind of math.

Using cAdvisor to Monitor Containers


Lesson Description:

Although we have our host monitored for various common metrics at this time, the Node Exporter doesn't cross the threshold into monitoring our containers. Instead, if we want to monitor anything we have in Docker, including our application, we need to add a container monitoring solution.Lucky for us, Google's cAdvisor is an open-source solution that works out of the box with most container platforms, including Docker. And once we have cAdvisor installed, we can see much of the same metrics we see for our host on our container, only these are provided to us through the prefix `container`.cAdvisor also monitors _all_ our containers automatically. That means when we view a metric, we're seeing it for everything that cAvisor monitors. Should we want to target specific containers, we can do so by using the `name` label, which pulls the container name from the name it uses in Docker itself.## Steps in This Video1. Launch cAdvisor:$ sudo docker run --volume=/:/rootfs:ro --volume=/var/run:/var/run:ro --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --volume=/dev/disk/:/dev/disk:ro --publish=8000:8080 --detach=true --name=cadvisor google/cadvisor:latest2. List available containers to confirm it's working:$ docker ps3. Update the Prometheus config:$ sudo $EDITOR /etc/prometheus/prometheus.yml- job_name: 'cadvisor' static_configs: - targets: ['localhost:8000']4. Restart Prometheus:$ sudo systemctl restart prometheus

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.


Application Monitoring

Using a Client Library


Lesson Description:

For us to actually add metrics into custom application that will not have available exporters and tools, we need to first add the client library for Prometheus to our application. Prometheus itself supports four client libraries — Go, Java, Python, and Ruby — but third-party client libraries are provided for a number of other languages, including Node.js, which is what our own application uses.For us to get started with instrumenting metrics on our app, we're going to need to add this library. But the Node.js `prom-client` is not restricted to just allowing us to write new metrics; it also includes some default application metrics we can enable, mostly centered around the application's use of memory.Of course, adding this library isn't enough in and of itself: We also need to make sure we have a `/metrics` endpoint generated that Prometheus can scrape, which we'll be creating using the Express framework our application already utilizes.## Steps in This Video1. Move into the `forethought` directory:cd forethought2. Install the `prom-client` via `npm`, Node.js's package manager:npm install prom-client --save3. Open the `index.js` file, where we'll be adding all of our metrics code:vim $EDITOR index.js4. Require the use of the `prom-client` by adding it to our variable list:var express = require('express'); var bodyParser = require('body-parser'); var app = express(); const prom = require('prom-client');With `prom` being the name we'll use when calling the client library.5. Enable default metrics scraping:const collectDefaultMetrics = prom.collectDefaultMetrics; collectDefaultMetrics({ prefix: 'forethought' });6. Use Express to create the `/metrics` endpoint and call in the Prometheus data:app.get('/metrics', function (req, res) { res.set('Content-Type', prom.register.contentType); res.end(prom.register.metrics()); });



Lesson Description:

The most common metric we've come across thus far, we already know counters measure how many times something happens or how long something happens — anything where we have to record an increasing amount of something. This can be useful for tracking requests, page hits, or how many times someone has used one particular API function — whatever we need to track.In the case of our application, we're going to use the Prometheus client library to create a counter that will keep track of how many name tasks are added to our to-do list over the course of its life. To do this, we first need to define our metric and give it a name, then we need to call that metric in the part of our code we want to keep track of — such as our `` function that adds our task.## Steps in This Video1. Open up the `index.js` file:cd forethought $EDITOR index.js2. Define a new metric called `forethought_number_of_todos_total` that works as a counter:// Prometheus metric definitions const todocounter = new prom.Counter({ name: 'forethought_number_of_todos_total', help: 'The number of items added to the to-do list, total' });3. Call the new metric in the `addtask` post function so it increases by one every time the function is called while adding a task:// add a task"/addtask", function(req, res) { var newTask = req.body.newtask; task.push(newTask); res.redirect("/");; });Save and exit.4. Test the application:node index.js5. While the application is running, visit MYLABSERVER:8080 and add a few tasks to the to-do list.6. Visit `MYLABSERVER:8080/metrics` to view your newly created metric!



Lesson Description:

We know gauges track the state of our metric across the time series — whether that be an amount of memory being used, how long a request or response takes, or how often an event is happening. Within our own application, we want to use a gauge so that we not only track how _many_ tasks are being added (as we did in the previous lesson), but also how many tasks are being completed, giving us a look at how many active tasks are left and the velocity at which our users are completing their to-dos.## Steps in This Video1. Define the new gauge metric for tracking tasks added and completed:const todogauge = new prom.Gauge ({ name: 'forethought_current_todos', help: 'Amount of incomplete tasks' });2. Add a gauge `.inc()` to the `/addtask` method:// add a task"/addtask", function(req, res) { var newTask = req.body.newtask; task.push(newTask); res.redirect("/");;; });3. Add a gauge `dec()` to the `/removetask` method:// remove a task"/removetask", function(req, res) { var completeTask = req.body.check; if (typeof completeTask === "string") { complete.push(completeTask); task.splice(task.indexOf(completeTask), 1); } else if (typeof completeTask === "object") { for (var i = 0; i < completeTask.length; i++) { complete.push(completeTask[i]); task.splice(task.indexOf(completeTask[i]), 1); todogauge.dec(); } } res.redirect("/"); });Save and exit the file.4. Test the application:node index.js5. While the application is running, visit MYLABSERVER:8080 and add a few tasks to the to-do list.6. Visit `MYLABSERVER:8080/metrics` to view your newly created metric!

Summaries and Histograms


Lesson Description:

Both summaries and histrograms are where our more advanced metric types come in. Both work similarly, working across multiple time series to keep track of sample observations, as well as the sum of the values of these observations. Each summary and histrogram contains not only the metrics we want to record — such as request time — but also the amount of requests (`_count`) and the sum of the total observations (`_sum`).## Steps in This Video1. Move into the `forethought` directory:cd forethought2. Install the Node.js module `response-time`:npm install response-time --save3. Open the `index.js` file:$EDITOR index.js4. Define both the summary and histogram metrics:const tasksumm = new prom.Summary ({ name: 'forethought_requests_summ', help: 'Latency in percentiles', }); const taskhisto = new prom.Histogram ({ name: 'forethought_requests_hist', help: 'Latency in history form', });5. Call the `response-time` module with the other variables:var responseTime = require('response-time');6. Around where we define our website code, add the `response-time` function, calling the `time` parameter within our `.observe` metrics:app.use(responseTime(function (req, res, time) { tasksumm.observe(time); taskhisto.observe(time); }));7. Save and exit the file.8. Run the demo application:node index.js9. View the demo application on port 8080, and add the tasks to generate metrics.10. View the `/metrics` endpoint. Notice how our response times are automatically sorted into percentiles for our summary. Also notice how we're not using all of our buckets in the histogram.11. Return to the command line and use **CTRL+C** to close the demo application.12. Reopen the `index.js` file:$EDITOR index.js13. Add the `buckets` parameter to the histogram definition. We're going to adjust our buckets based on the response times collected:const taskhisto = new prom.Histogram ({ name: 'forethought_requests_hist', help: 'Latency in history form', buckets: [0.1, 0.25, 0.5, 1, 2.5, 5, 10] });13. Save and exit. Run `node index.js` again to test.

Redeploying the Application


Lesson Description:

In this video, we're updating our Docker image to use our newly instrumented version of our application. No Docker knowledge needed — just follow these steps!1. Stop the current Docker container for our application:docker stop ft-app2. Remove the container:docker rm ft-app3. Remove the image:docker image rm forethought4. Rebuild the image:docker build -t forethought .5. Deploy the new container:docker run --name ft-app -p 80:8080 -d forethought 6. Add the application as an endpoint to Prometheus:sudo $EDITOR /etc/prometheus/prometheus.yml- job_name: 'forethought' static_configs: - targets: ['localhost:80'] Save and exit. 7. Restart Prometheus:sudo systemctl restart prometheus

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.


Managing Alerts

Recording Rules


Lesson Description:

With our monitoring set up across multiple levels of our stack, we can now go ahead and start doing more things with the metrics we're recording. Namely, we can begin to use these metrics to record "rules" in Prometheus — this essentially lets us pre-run a common PromQL result and record those results. We can then alert on these results as needed.But what metrics do we even _want_ to alert on? We've already discussed that in an ideal world, we'll be alerting on the issue itself, not the symptom of the issue. But chances are we won't have the insight to know what the issues are until we have our monitoring up and running for a bit and learn the nuances are our system. Instead, in instances where we aren't sure what to alert on, we want to alert based on what the end user experiences.## Steps in This Video1. Using the expression editor, view the uptime of all targets:up2. Since we don't want to alert on each individual job and instance we have, let's take the average of our uptime instead:avg (up)3. We do not want an average of _everything_, however. Next, use the `without` clause to ensure we're not merging our targets by instance:avg without (instance) (up)4. Further refine the expression so we only see the uptime for our `forethought` jobs:avg without (instance) (up{job="forethought"})5. Now that we have our expression written, we can look into how to add this as a rule. Switch to your terminal.6. Open the Prometheus configuration file:$ sudo $EDITOR /etc/prometheus/prometheus.yml7. Locate the `rule_files` parameter. Add a rule file at `rules.yml`:rule_files: - "rules.yml"Save and exit the file.8. Create the `rules.yml` file in `/etc/prometheus`:$ sudo $EDITOR /etc/prometheus/rules.yml9. Every rule needs to be contained in a group. Define a group called `uptime`, which will track the uptime of anything that affects the end user:groups: - name: uptime10. We're first going to define a _recording_ rule, which will keep track of the results of a PromQL expression, without performing any kind of alerting:groups: - name: uptime rules: - record: job:uptime:average:ft expr: avg without (instance) (up{job="forethought"})Notice that format of `record` — this is the setup we need to use to define the name of our recording rule. Once defined, we can call this metric directly in PromQL.The `expr` is just the expression, as we would normally write it in the expression editor.Save and exit the file.11. Restart Promtheus for the rules changes to take effect:$ sudo systemctl restart prometheus12. Return to the web UI and navigate to **Status** > **Rules**.13. Click on the provided rule — it will take us to the expression editor! Return to the **Rules** page when done.

Alerting Rules


Lesson Description:

With our recording rule created, we can start thinking about creating an alert based on this. For this specific example, we're going to write rules that will alert us when our application containers are 25% down — while we only have one application container at this time, on actual production systems, we know we'll be running multiple instances of our application, and one of these containers going down isn't going to devastate us. Instead, we want to watch for the point where our end users _will_ start noticing.## Steps in This Video1. Now that we have a recording rule, we can build our alerting rule based on this. We know we want to alert when we have less than 75% of our application containers up, so we'll use the `job:uptime:average:ft < .75` expression:groups: - name: uptime rules: - record: job:uptime:average:ft expr: avg without (instance) (up{job="forethought"}) - alert: ForethoughtApplicationDown expr: job:uptime:average:ft < .75Notice how we define this rule with `alert` instead of `record` and that the name does not have to follow the previously defined format.Save and exit when done.2. Restart Prometheus:$ sudo systemctl restart prometheus3. Refresh the **Rules** page to view the second rule.



Lesson Description:

With our basic alerting rules set up, we now want to establish our alerts so they work best for us. A major part of this is ensuring we're only alerting when needed. Consider our `ForethoughtApplicationDown` alert: What if we're restarting a percentage of our application instances to provide updates? We might very well end up causing an alert based on only one instance of 25% of our applications being down — and remember, these expressions are run every 10 seconds.To prevent alerts from firing in instances such as this, we use the `for` parameter:groups: - name: uptime rules: - record: job:uptime:average:ft expr: avg without (instance) (up{job="forethought"}) - alert: ForethoughtApplicationDown expr: job:uptime:average:ft < .75 for: 5mWhen we set `for` to `5m`, we're telling Prometheus to hold the alert in a `pending` state until it's been down for five minutes. Then — and only then — will it fire the alert to Alertmanager. This prevents any unnecessary alerting from issues like the above. As your monitoring system matures, you'll often find yourself adjusting this number based on frequency and severity of alerts. Don't be surprised if you end up with `for` times up to an hour or more!



Lesson Description:

Annotations let us pass in additional information to our alerts. These are written as key-value pairs in the YAML itself and can make use of Go's templating language to pull in special values. Generally, we want to provide any relevant information we can in the annotations, including information about the issue itself, links to any documentation, and debugging information.Two variables are provided for us to use: `$value`, which calls the value of the expression that triggered the alert (`job:uptime:average:ft < .75` for our example alert), and `$label.NAME`, which lets us call a label by its name. So if we wanted to call our `job` label, we would use `$label.job`.At the very least, we generally want to include an overview of the issue at hand, ensuring whoever is addressing the issue knows both what the problem is and which parts of your platform are affected:groups: - name: uptime rules: - record: job:uptime:average:ft expr: avg without (instance) (up{job="forethought"}) - alert: ForethoughtApplicationDown expr: job:uptime:average:ft < .75 for: 5m annotations: overview: '{{printf "%.2f" $value}}% instances are up for {{ $labels.job }}'



Lesson Description:

Up until this point, much of our configurations for our alerts have been directly for our benefit — clear annotations, a `for` value to make sure we don't get alerted unnecessarily — but we also want to include labels for better routing to Alertmanager.Labels are key-value pairs that will eventually let us sort through our tickets by what we deem important. These should be consistent across your alerts and generally contain information such as severity and which team will take over to address the issue. This way, once we get our alerts into the Alertmanager, we can sort them by these labels, ensuring we're funneling our alerts to the right place.We can set these alerts via the `labels` parameter, just as we did for our annotations:groups: - name: uptime rules: - record: job:uptime:average:ft expr: avg without (instance) (up{job="forethought"}) - alert: ForethoughtApplicationDown expr: job:uptime:average:ft < .75 for: 5m labels: severity: page team: devops annotations: overview: '{{printf "%.2f" $value}}% instances are up for {{ $labels.job }}'

Preparing Our Receiver


Lesson Description:

Before we can actually set up the Alertmanager to pass on alerts, we want to set up a place where we can receive them. For this, we're going to set up a Slack account — if you already have access to a workspace you can add applications and test on, just use that and skip this section. Otherwise, we're going to walk through setting up a workspace, adding an app, and configuring it to use webhooks.## Steps in This Video1. Go to []( and create a new workspace, following the step-by-step instructions on screen until you are given your workspace. Be sure to add a `prometheus` channel!2. From your chat, use the workspace menu to go to **Administration** and then **Manage apps**.3. Select **Build** on the top menu.4. Press **Start Building**, then **Create New App**. Give your application a name, and then select the workspace you just created. Click **Create App** when done.5. Select **Incoming Webhooks** from the menu.6. Turn webhooks on.7. Click **Add New Webhook to Workspace**, setting the channel name to the `prometheus` channel. **Authorize** the webhook.8. Make note of the webhook URL.

Using Alertmanager


Lesson Description:

Now that we have our alerts up and we know they're working, we need to get our actual Alertmanager set up. This is where we'll use our labels to route our alerts to the people and places they need to go. Specifically, we'll route our alert to our Slack channel.Our Alertmanager configuration is divided into four sections: `global`, which stores any configuration that will remain consistent across the entire file, such as our email; `route`, which defines how we want to sort our files to our receivers; `receivers`, which define the endpoints where we want to receive our alerts; and `inhibit_rules`, which let us define rules to suppress related alerts so we don't spam messages.## Steps in This Video1. Open the Alertmanager configuration:$ sudo $EDITOR /etc/alertmanager/alertmanager.yml2. Set the default route's `repeat_interval` to one minute and update the receiver to use our Slack endpoint:route: receiver: 'slack' group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1m3. Create a secondary route that will send `severe: page` alerts to the Slack receiver; group by the `team` label:route: receiver: 'slack' group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1m routes: - match: severity: page group_by: ['team'] receiver: 'slack'4. Add a tertiary route that sends all alerts for the `devops` team to Slack:route: receiver: 'slack' group_by: ['alertname'] repeat_interval: 1m routes: - match: severity: page group_by: ['team'] receiver: 'slack' routes: - match: team: devops receiver: 'slack'5. Update the receiver to use Slack:receivers: - name: 'slack' slack_configs: - channel: "#prometheus" api_url: APIKEY text: "Overview: {{ .CommonAnnotations.overview }}"6. Update the `inhibit_rules` so that any alerts with the severity of `ticket` for the DevOps team are suppressed when a `page`-level alert is happening:inhibit_rules: - source_match: severity: 'page' target_match: severity: 'ticket' equal: ['team']Save and exit.7. Restart Alertmanager:$ sudo systemctl restart alertmanager8. View your Slack chat and wait to see the firing alert.



Lesson Description:

Once we have our alert firing, we need to know how to pause the alert for the time it takes us to fix the issue. We can do this from the Alertmanager web UI at port 9093. From there, we have the option to either select **Silence** next to the alert itself, or click **Silences** at the top of the screen and select **New Silence**.Once we have the silence window open, we're presented with a number of options.First, we want to set the length of the silence. For an exising alert, set it to the amount of time you think it will take you to troubleshoot the issue. If you're setting an alert for expected downtime, give yourself enough time to complete any downtime tasks and solve unexpected issues.Finally, we need to make note of who is making the silence and a comment regarding the silence. For this, go ahead and note that the Docker container is down and you're silencing until you restart.At this point, you may also want to restart your Docker container:$ docker start ft-app

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.



Adding a Dashboard


Lesson Description:

When it comes to setting up visualizations that will remain up and running persistently on our Grafana setup, it might seem overwhelming to consider which of our countless metrics we need to rely on. One way to mitigate this problem is using one of the prebuilt dashboards shared to the Grafana community. These dashboards are pre-created setups based on common tools and configurations we frequently see when monitoring various infrastructures, containers, and applications. As much as we want to believe our platform is completely unique — and it, technically, most likely is — there will be enough similarlities to our setup with ones created by others that pulling in some prebuilt dashboards is a practical way to start.## Steps in This Video1. Log in to your Grafana dashboard at `PUBLICIP:3000`.2. In another tab, go to the [Grafana dashboard website]( Search for the "Node Exporter Full" dashboard, and copy the dashboard ID.4. Back on your Grafana instance, select the plus sign on the side menu and click **Import**. Paste in the dashboard ID.5. Create a new folder called "Prometheus", and also select the Prometheus data source. **Import**.6. To edit a panel, select the name of a panel and click **Edit**. Singlestat panels are the panels that display a single number, while graphs are the primary panels that display data over time via an x- and a-axis.

Building a Panel


Lesson Description:

Once we get some of the simpler parts of our Grafana setup configured, we'll want to start looking into adding metrics we won't find any pre-created dashboards for, such as metrics related specifically to our application. While using a default Node Exporter dashboard saves us a lot of time and works well for monitoring our infrastructure metrics, we don't want to leave our applications in the dark.## Steps in This Video1. Return to your Grafana instance at port 3000.2. Switch to the *Forethought* dashboard.3. Click **Add Panel**. Select **Heatmap**.4. When the panel appears, click on the name and then **Edit**.5. Switch to the **General** tab, and set the name of the chart to *Response Time Distribution*.6. Return to the **Metrics** tab. We're going to calculate the average response time over time of each of our buckets:sum(rate(forethought_requests_hist_bucket[30s])) by (le)Set the **Legend** to `{{le}}`.7. From the **Axes** tab, switch the **Data format** to *Time series buckets*.8. If desired, further alter the graph's colors and appearance by using the **Display** tab.9. Return to the dashboard.10. Click the **Save** icon, add a comment, and **Save**.

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.


Final Thoughts

Congratulations and Next Steps


Lesson Description:

You did it! Congratulations — you've taken the time to learn how to use monitoring concepts at all levels of your platform. You're now prepared to not just set up a monitoring stack but use metrics, write metrics into code, and manipulate those metrics across various tools, such as Grafana, to keep track of just what is happening "under the hood" of your system.Not ready to take a break yet? That's okay! We have plenty of courses that will sate that need to learn:+ [YAML Essentials]( + [Monitoring Kubernetes with Prometheus]( + [Elastic Stack Essentials]( + [Elasticsearch Deep Dive](

Take this course and learn a new skill today.

Transform your learning with our all access plan.

Start 7-Day Free Trial