Skip to main content

Monitoring Kubernetes With Prometheus

Course

Intro Video

Photo of Travis Thomsen

Travis Thomsen

Course Development Director in Content

I have over 17 years of experience in all phases of the software development life cycle, which includes software analysis, design, development, testing, implementation, debugging, maintenance and documentation. I am passionate about learning new technologies, methodologies, languages and automation.

Length

05:00:00

Difficulty

Intermediate

Videos

24

Hands-on Labs

3

Course Details

Are you interested in deploying Prometheus to Kubernetes? If so, this is the course for you.

This course covers the basics of Prometheus, which includes its architecture and components, such as exporters, client libraries, and alerting.

From there, you will learn how to deploy Prometheus to Kubernetes and configure Prometheus to monitor the cluster as well as applications deployed to it.

You will also learn the basics of PromQL, which includes the syntax, functions, and creating recording rules.

Finally, the course will close out by talking about the Alertmanager and creating alerting rules.

Donwload the Interactive Diagrams here:

https://interactive.linuxacademy.com/diagrams/MonitoringKubernetswithPrometheus.html

https://interactive.linuxacademy.com/diagrams/ApplicationMetrics.html

https://interactive.linuxacademy.com/diagrams/ExporterMetrics.html

https://interactive.linuxacademy.com/diagrams/NodeExporter.html

Syllabus

Introduction

About This Course

00:01:57

Lesson Description:

This video will go over the highlights of this course:Prometheus Architecture Run Prometheus on Kubernetes Application Monitoring PromQL AlertingI will also discuss the prerequisites for this course.

About the Instructor

00:00:55

Lesson Description:

Before we get started on the course, let's learn a little about who is teaching it!

What is Prometheus?

00:01:39

Lesson Description:

Before we jump into the technical details of this course, we will take a five thousand foot view of what Prometheus is.

Setting Up Your Environment

Using Cloud Playground

00:06:16

Lesson Description:

In this video, you will learn how to use Cloud Playground to create the Cloud Servers you will need to complete this course. You will also be shown how to use the web terminal as an alternative to using SSH.

Setting Up a Kubernetes Cluster

00:07:57

Lesson Description:

In this lesson, you will setup your Kubernetes cluster. We will start by installing the Master node. Setting up the Kubernetes Master The following actions will be executed on the Kubernetes Master.Disable swap

swapoff -a
Edit: /etc/fstab
vi /etc/fstab
Comment out swap
#/root/swap swap swap sw 0 0
Add the Kubernetes repo
cat << EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
exclude=kube*
EOF
Disable SELinux
setenforce 0
Permanently disable SELinux:
vi /etc/selinux/config
Change enforcing to disabled
SELINUX=disabled
Install Kubernetes 1.11.3
yum install -y kubelet-1.11.3 kubeadm-1.11.3 kubectl-1.11.3 kubernetes-cni-0.6.0 --disableexcludes=kubernetes
Start and enable the Kubernetes service
systemctl start kubelet && systemctl enable kubelet
Create the k8s.conf file:
cat << EOF >  /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF
sysctl --system
Create kube-config.yml:
vi kube-config.yml
Add the following to kube-config.yml:
apiVersion: kubeadm.k8s.io/v1alpha1
kind:
kubernetesVersion: "v1.11.3"
networking:
  podSubnet: 10.244.0.0/16
apiServerExtraArgs:
  service-node-port-range: 8000-31274
Initialize Kubernetes
kubeadm init --config kube-config.yml
Copy admin.conf to your home directory
mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
Install flannel
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/v0.9.1/Documentation/kube-flannel.yml
Patch flannel
vi /etc/kubernetes/manifests/kube-controller-manager.yaml
Add the following to kube-controller-manager.yaml: ```--allocate-node-cidrs=true --cluster-cidr=10.244.0.0/16
Then reolad kubelete
systemctl restart kubelet

#### Setting up the Kubernetes Worker

Now that the setup for the Kubernetes master is complete, we will begin the process of configuring the worker node. The following actions will be executed on the Kubernetes worker.

1. Disable swap
swapoff -a
2. Edit: `/etc/fstab`
vi /etc/fstab
3. Comment out swap
#/root/swap swap swap sw 0 0
4. Add the Kubernetes repo
cat << EOF > /etc/yum.repos.d/kubernetes.repo [kubernetes] name=Kubernetes baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64 enabled=1 gpgcheck=1 repo_gpgcheck=1 gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg exclude=kube* EOF
5. Disable SELinux
setenforce 0
6. Permanently disable SELinux:
vi /etc/selinux/config
7. Change enforcing to disabled
SELINUX=disabled
8. Install Kubernetes 1.11.3
yum install -y kubelet-1.11.3 kubeadm-1.11.3 kubectl-1.11.3 kubernetes-cni-0.6.0 --disableexcludes=kubernetes
9. Start and enable the Kubernetes service
systemctl start kubelet && systemctl enable kubelet
10. Create the `k8s.conf` file:
cat << EOF > /etc/sysctl.d/k8s.conf net.bridge.bridge-nf-call-ip6tables = 1 net.bridge.bridge-nf-call-iptables = 1 EOF sysctl --system
11. Use the join token to add the Worker Node to the cluster:
kubeadm join < MASTER_IP >:6443 --token < TOKEN > --discovery-token-ca-cert-hash sha256:< HASH >
12. On the master node, test to see if the cluster was created properly.

13. Get a listing of the nodes:
kubectl get nodes

Prometheus Architecture

Prometheus Architecture Diagram

00:02:57

Lesson Description:

In this lesson, we will be reviewing the Prometheus Architecture Diagram and go over the various components.

Client Libraries

00:01:21

Lesson Description:

You use client libraries and instrumentation to gather metrics for Prometheus to scrape. Prometheus scrapes your application's HTTP endpoint. Client libraries send the current state of all metrics tracked to the Prometheus server. You can develop your own client library if one doesn't exist. This is the code used to instrument the app using the NodeJS library prom-client:

var Register = require('prom-client').register;
var Counter = require('prom-client').Counter;
var Histogram = require('prom-client').Histogram;
var Summary = require('prom-client').Summary;
var ResponseTime = require('response-time');


module.exports.totalNumOfRequests = totalNumOfRequests = new Counter({
    name: 'totalNumOfRequests',
    help: 'Total number of requests made',
    labelNames: ['method']
});

module.exports.pathsTaken = pathsTaken = new Counter({
    name: 'pathsTaken',
    help: 'Paths taken in the app',
    labelNames: ['path']
});

module.exports.responses = responses = new Summary({
    name: 'responses',
    help: 'Response time in millis',
    labelNames: ['method', 'path', 'status']
});

module.exports.startCollection = function () {
    require('prom-client').collectDefaultMetrics();
};

module.exports.requestCounters = function (req, res, next) {
    if (req.path != '/metrics') {
        totalNumOfRequests.inc({ method: req.method });
        pathsTaken.inc({ path: req.path });
    }
    next();
}

module.exports.responseCounters = ResponseTime(function (req, res, time) {
    if(req.url != '/metrics') {
        responses.labels(req.method, req.url, res.statusCode).observe(time);
    }
})

module.exports.injectMetricsRoute = function (App) {
    App.get('/metrics', (req, res) => {
        res.set('Content-Type', Register.contentType);
        res.end(Register.metrics());
    });
};
Prometheus supported libraries:Go Java or Scala Python RubyThird-party libraries:Bash C++ Common Lisp Elixir Erlang Haskell Lua for Nginx Lua for Tarantool .NET / C# Node.js Perl PHP Rust

Exporters

00:01:41

Lesson Description:

Exporters are software that is deployed next to the application that you want to have metrics collected from. Instrumentation for exporters are known as custom collectors or ConstMetrics. How exporters work:Takes requests Gathers the data Formats the data Returns the data to PrometheusDatabases:Consul exporter Memcached exporter MySQL server exporterHardware:Node/system metrics exporterHTTP:HAProxy exporterOther monitoring systems:AWS CloudWatch exporter Collectd exporter Graphite exporter InfluxDB exporter JMX exporter SNMP exporter StatsD exporterMiscellaneous:Blackbox exporter

Service Discovery

00:02:42

Lesson Description:

In this lesson you will learn about Service Discovery. Service Discovery is way for Prometheus to find targets without having to statically configure them.

Scraping

00:01:26

Lesson Description:

In this lesson, you will learn the difference between push and pull monitoring systems. We will also discuss how Prometheus defines scraping.

Run Prometheus on Kubernetes

Setting Up Prometheus

00:15:57

Lesson Description:

In this lesson, we will set up Prometheus on the Kubernetes cluster. We will be creating:A metrics namespace for our environment to live in A ClusterRole to give Prometheus access to targets using Service Discovery A ConfigMap map that will be used to generate the Prometheus config file A Prometheus Deployment and Service Kube State Metrics to get access to metrics on the Kubernetes APIYou can clone the YAML files form Github. Create a file called namespaces.yml. This file will be used to create the monitoring namespace.namespaces.yml

{
  "kind": "Namespace",
  "apiVersion": "v1",
  "metadata": {
    "name": "monitoring",
    "labels": {
      "name": "monitoring"
    }
  }
}
Apply the namespace:
kubectl apply -f namespaces.yml
Create a file called clusterRole.yml. This will be used to set up the cluster's roles.clusterRole.yml:
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: default
  namespace: monitoring
Apply the cluster roles to the Kubernetes cluster:
kubectl apply -f clusterRole.yml
Create config-map.yml. Kubernetes will use this file to manage the prometheus.yml configuration file.config-map.yml:
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server-conf
  labels:
    name: prometheus-server-conf
  namespace: monitoring
data:
  prometheus.yml: |-
    global:
      scrape_interval: 5s
      evaluation_interval: 5s

    scrape_configs:
      - job_name: 'kubernetes-apiservers'

        kubernetes_sd_configs:
        - role: endpoints
        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'kubernetes-nodes'

        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        kubernetes_sd_configs:
        - role: node

        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics


      - job_name: 'kubernetes-pods'

        kubernetes_sd_configs:
        - role: pod

        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::d+)?;(d+)
          replacement: $1:$2
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name

      - job_name: 'kubernetes-cadvisor'

        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        kubernetes_sd_configs:
        - role: node

        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

      - job_name: 'kubernetes-service-endpoints'

        kubernetes_sd_configs:
        - role: endpoints

        relabel_configs:
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::d+)?;(d+)
          replacement: $1:$2
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_name
Create the ConfigMap:
kubectl apply -f config-map.yml
Create prometheus-deployment.yml. This file will be used to create the Prometheus deployment; which will include the pods, replica sets and volumes.prometheus-deployment.yml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prometheus-deployment
  namespace: monitoring
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: prometheus-server
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:v2.2.1
          args:
            - "--config.file=/etc/prometheus/prometheus.yml"
            - "--storage.tsdb.path=/prometheus/"
            - ""--web.enable-lifecycle"
          ports:
            - containerPort: 9090
          volumeMounts:
            - name: prometheus-config-volume
              mountPath: /etc/prometheus/
            - name: prometheus-storage-volume
              mountPath: /prometheus/
      volumes:
        - name: prometheus-config-volume
          configMap:
            defaultMode: 420
            name: prometheus-server-conf

        - name: prometheus-storage-volume
          emptyDir: {}
Deploy the Prometheus environment:
kubectl apply -f prometheus-deployment.yml
Finally, we will finish off the Prometheus environment by creating a server to make publicly accessible. Create prometheus-service.yml. prometheus-service.yml:
apiVersion: v1
kind: Service
metadata:
  name: prometheus-service
  namespace: monitoring
  annotations:
      prometheus.io/scrape: 'true'
      prometheus.io/port:   '9090'

spec:
  selector:
    app: prometheus-server
  type: NodePort
  ports:
    - port: 8080
      targetPort: 9090
      nodePort: 8080
Create the service that will make Prometheus publicly accessible:
kubectl apply -f prometheus-service.yml
Create the clusterRole.yml file to set up access so Prometheus can access metrics using Service Discovery.clusterRole.yml:
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: default
  namespace: monitoring
Crate the Kube State Metrics pod to get access to metrics on the Kubernetes API:kube-state-metrics.yml:
apiVersion: v1
kind: Service
metadata:
  name: kube-state-metrics
  namespace: monitoring
  labels:
    app: kube-state-metrics
  annotations:
    prometheus.io/scrape: 'true'
spec:
  ports:
  - name: metrics
    port: 8080
    targetPort: metrics
    protocol: TCP
  selector:
    app: kube-state-metrics
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: monitoring
  labels:
    app: kube-state-metrics
spec:
  replicas: 1
  template:
    metadata:
      name: kube-state-metrics-main
      labels:
        app: kube-state-metrics
    spec:
      containers:
        - name: kube-state-metrics
          image: quay.io/coreos/kube-state-metrics:latest
          ports:
          - containerPort: 8080
            name: metrics
Access Prometheus by visiting https://<MASTER_IP>:8080

Configuring Prometheus

00:13:10

Lesson Description:

In this lesson you will learn about the Prometheus configuration file, how to configure static targets, as well as how to use service discovery to find Kubernetes endpoints. Below is the contents of prometheus.conf that was created by the Config Map. prometheus.conf:

global:
  scrape_interval: 5s
  evaluation_interval: 5s

scrape_configs:
  - job_name: 'kubernetes-apiservers'

    kubernetes_sd_configs:
    - role: endpoints
    scheme: https

    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    relabel_configs:
    - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
      action: keep
      regex: default;kubernetes;https

  - job_name: 'kubernetes-nodes'

    scheme: https

    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    kubernetes_sd_configs:
    - role: node

    relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
    - target_label: __address__
      replacement: kubernetes.default.svc:443
    - source_labels: [__meta_kubernetes_node_name]
      regex: (.+)
      target_label: __metrics_path__
      replacement: /api/v1/nodes/${1}/proxy/metrics


  - job_name: 'kubernetes-pods'

    kubernetes_sd_configs:
    - role: pod

    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: ([^:]+)(?::d+)?;(d+)
      replacement: $1:$2
      target_label: __address__
    - action: labelmap
      regex: __meta_kubernetes_pod_label_(.+)
    - source_labels: [__meta_kubernetes_namespace]
      action: replace
      target_label: kubernetes_namespace
    - source_labels: [__meta_kubernetes_pod_name]
      action: replace
      target_label: kubernetes_pod_name

  - job_name: 'kubernetes-cadvisor'

    scheme: https

    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    kubernetes_sd_configs:
    - role: node

    relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
    - target_label: __address__
      replacement: kubernetes.default.svc:443
    - source_labels: [__meta_kubernetes_node_name]
      regex: (.+)
      target_label: __metrics_path__
      replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

  - job_name: 'kubernetes-service-endpoints'

    kubernetes_sd_configs:
    - role: endpoints

    relabel_configs:
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
      action: replace
      target_label: __scheme__
      regex: (https?)
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
      action: replace
      target_label: __address__
      regex: ([^:]+)(?::d+)?;(d+)
      replacement: $1:$2
    - action: labelmap
      regex: __meta_kubernetes_service_label_(.+)
    - source_labels: [__meta_kubernetes_namespace]
      action: replace
      target_label: kubernetes_namespace
    - source_labels: [__meta_kubernetes_service_name]
      action: replace
      target_label: kubernetes_name
Prometheus Configuration Documentation

Setting Up Grafana

00:04:40

Lesson Description:

In this lesson, you will learn how to deploy a Grafana pod and service to Kubernetes. Create grafana-deployment.yml. This file will be used to create the Grafana deployment. Be sure to change the password. grafana-deployment.yml:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
  labels:
    app: grafana
    component: core
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: grafana
        component: core
    spec:
      containers:
        - image: grafana/grafana:3.1.1
          name: grafana
          env:
            - name: GF_SECURITY_ADMIN_PASSWORD
              value: password
          ports:
            - containerPort: 3000
          volumeMounts:
          - name: grafana-persistent-storage
            mountPath: /var
      volumes:
      - name: grafana-persistent-storage
        emptyDir: {}
Deploy Grafana:
kubectl apply -f grafana-deployment.yml
Crate grafana-service.yml. This file will be used to make the pod publicly accessible. grafana-service.yml:
apiVersion: v1
kind: Service
metadata:
  name: grafana-service
  namespace: monitoring

spec:
  selector:
    app: grafana
  type: NodePort
  ports:
    - port: 3000
      targetPort: 3000
      nodePort: 8000
Create the Grafana service:
kubectl apply -f grafana-service.yml

NodeExporter

00:05:13

Lesson Description:

Repeat these steps on both your master and worker nodes. Create the Prometheus user:

adduser prometheus
Download Node Exporter:
cd /home/prometheus
curl -LO "https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.linux-amd64.tar.gz"
tar -xvzf node_exporter-0.16.0.linux-amd64.tar.gz
mv node_exporter-0.16.0.linux-amd64 node_exporter
cd node_exporter
chown prometheus:prometheus node_exporter
vi /etc/systemd/system/node_exporter.service
/etc/systemd/system/node_exporter.service:
[Unit]
Description=Node Exporter

[Service]
User=prometheus
ExecStart=/home/prometheus/node_exporter/node_exporter

[Install]
WantedBy=default.target
Reload systemd:
systemctl daemon-reload
Enable the node_exporter service:
systemctl enable node_exporter.service
Start the node_exporter service:
systemctl start node_exporter.service

Expression Browser

00:04:40

Lesson Description:

In this lesson, you will learn how to use the Expression browser to execute queries, view your Prometheus configuration, and Prometheus targets. Container CPU load average:

container_cpu_load_average_10s
Memory usage query:
((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100

Adding a Grafana Dashboard

00:03:54

Lesson Description:

In this lesson, you will import a Grafana Dashboard that will be used to visualize metrics imported from the NodeExporter. Below are the links to the dashboard. Content Kubernetes Prometheus Env Repository Kubernetes Nodes Dashboard

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

01:00:00

Application Monitoring

Instrumenting Applications

00:05:37

Lesson Description:

This lesson discusses how to instrument an application by using a Prometheus client library. Though we will be talking about a NodeJS application, there are client libraries available for a wide variety of programming languages. You can clone the Comic Box App here.

Collecting Metrics from Applications

00:06:08

Lesson Description:

In this lesson, you will deploy a NodeJS application to Kubernets that will be monitored by Prometheus. Github Link: https://github.com/linuxacademy/content-kubernetes-prometheus-app Build a Docker image:

docker build -t rivethead42/comicbox .
Login to Docker Hub:
docker login
Push the image to Docker Hub:
docker push < USERNAME >/comicbox
Create a deployment using the image above:
kubectl apply -f deployment.yml

PromQL

PromQL Basics

00:04:42

Lesson Description:

In this lesson, you will learn the basics of Prometheus' query language—PromQL. This includes queries using the metric name and then filtering it using labels. Return all time series with the metric node_cpu_seconds_total:

node_cpu_seconds_total
Return all time series with the metric node_cpu_seconds_total and the given job and mode labels:
node_cpu_seconds_total{job="node-exporter", mode="idle"}
Return a whole range of time (in this case 5 minutes) for the same vector, making it a range vector:
node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m]
Query job that end with -exporter:
node_cpu_seconds_total{job=~".*-exporter"}
Query job that begins with kube:
container_cpu_load_average_10s{job=~"^kube.*"}

PromQL Operations and Functions

00:03:32

Lesson Description:

In this lesson, you will learn how to add operations and functions to your PromQL expressions. Arithmetic binary operators:+ (addition) - (subtraction) * (multiplication) / (division) % (modulo) ^ (power/exponentiation)Comparison binary operators:== (equal) != (not-equal) > (greater-than) < (less-than) >= (greater-or-equal) <= (less-or-equal)Logical/set binary operators:and (intersection) or (union) unless (complement)Aggregation operators:sum (calculate sum over dimensions) min (select minimum over dimensions) max (select maximum over dimensions) avg (calculate the average over dimensions) stddev (calculate population standard deviation over dimensions) stdvar (calculate population standard variance over dimensions) count (count number of elements in the vector) count_values (count number of elements with the same value) bottomk (smallest k elements by sample value) topk (largest k elements by sample value) quantile (calculate ?-quantile (0 ? ? ? 1) over dimensions)Get the total memory in bytes:

node_memory_MemTotal_bytes
Get a sum of the total memory in bytes:
sum(node_memory_MemTotal_bytes)
Get a percentage of total memory used:
((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100
Using a function with your query:
irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])
Using an operation and a function with your query:
avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m]))
Grouping your queries:
avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) by (instance)

Recording Rules

00:07:03

Lesson Description:

Create prometheus-read-rules-map.yml. This file will be used to create a recording rule for Prometheus.prometheus-read-rules-map.yml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-read-rules-conf
  labels:
    name: prometheus-read-rules-conf
  namespace: monitoring
data:
  node_rules.yml: |-
    groups:
    - name: node_rules
      interval: 10s
      rules:
        - record: instance:node_cpu:avg_rate5m
          expr: 100 - avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) by (instance) * 100
        - record: instance:node_memory_usage:percentage
          expr: ((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100
        - record: instance:root:node_filesystem_usage:percentage
          expr: (node_filesystem_size_bytes{mountpoint="/rootfs"} - node_filesystem_free_bytes{mountpoint="/rootfs"}) /node_filesystem_size_bytes{mountpoint="/rootfs"} * 100
Apply the recording rule:
kubectl apply -f prometheus-read-rules-map.yml
Update the prometheus-config-map.yml with record rules.prometheus-config-map.yml:
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server-conf
  labels:
    name: prometheus-server-conf
  namespace: monitoring
data:
  prometheus.yml: |-
    global:
      scrape_interval: 5s
      evaluation_interval: 5s

    rule_files:
    - rules/*_rules.yml

    scrape_configs:
      - job_name: 'node-exporter'
        static_configs:
        - targets: ['<KUBERNETES_IP>:9100', '<KUBERNETES_IP>:9100']

      - job_name: 'kubernetes-apiservers'

        kubernetes_sd_configs:
        - role: endpoints
        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'kubernetes-nodes'

        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        kubernetes_sd_configs:
        - role: node

        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics

      - job_name: 'kubernetes-pods'

        kubernetes_sd_configs:
        - role: pod

        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::d+)?;(d+)
          replacement: $1:$2
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name

      - job_name: 'kubernetes-cadvisor'

        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        kubernetes_sd_configs:
        - role: node

        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

      - job_name: 'kubernetes-service-endpoints'

        kubernetes_sd_configs:
        - role: endpoints

        relabel_configs:
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::d+)?;(d+)
          replacement: $1:$2
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_name
Apply the update configuration file:
kubectl apply -f prometheus-config-map.yml
Add a new volume for the recording rules:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: prometheus-deployment
namespace: monitoring
spec:
replicas: 1
template:
 metadata:
   labels:
     app: prometheus-server
 spec:
   containers:
     - name: prometheus
       image: prom/prometheus:v2.2.1
       args:
         - "--config.file=/etc/prometheus/prometheus.yml"
         - "--storage.tsdb.path=/prometheus/"
         - "--web.enable-lifecycle"
       ports:
         - containerPort: 9090
       volumeMounts:
         - name: prometheus-config-volume
           mountPath: /etc/prometheus/
         - name: prometheus-storage-volume
           mountPath: /prometheus/
         - name: prometheus-read-rules-volume
           mountPath: /etc/prometheus/rules
     - name: watch
       image: weaveworks/watch:master-5b2a6e5
       imagePullPolicy: IfNotPresent
       args: ["-v", "-t", "-p=/etc/prometheus", "-p=/var/prometheus", "curl", "-X", "POST", "--fail", "-o", "-", "-sS", "http://localhost:9090/-/reload"]
       volumeMounts:
         - name: prometheus-config-volume
           mountPath: /etc/prometheus
   volumes:
     - name: prometheus-config-volume
       configMap:
         defaultMode: 420
         name: prometheus-server-conf

     - name: prometheus-read-rules-volume
       configMap:
         defaultMode: 420
         name: prometheus-read-rules-conf

     - name: prometheus-storage-volume
       emptyDir: {}
Apply the updates to the Prometheus deployment:
kubectl apply -f prometheus-deployment.yml

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

01:00:00

Alerting

Alertmanager

00:12:28

Lesson Description:

In this lesson, you will learn how to set up Alertmanager to work with Prometheus. Below are the files that will be used to complete this task:Create a Config Map that will be used to set up the Alertmanager config file.alertmanager-configmap.yml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-conf
  labels:
    name: alertmanager-conf
  namespace: monitoring
data:
  alertmanager.yml: |
    global:
      smtp_smarthost: 'localhost:25'
      smtp_from: 'alertmanager@linuxacademy.org'
      smtp_require_tls: false
    route:
      receiver: slack_receiver
    receivers:
    - name: slack_receiver
      slack_configs:
      - send_resolved: true
        username: '<SLACK_USER>'
        api_url: '<APP_URL>'
        channel: '#<CHANNEL>'
Create a deployment file that will be used to stand up the Alertmanager deployment.alertmanager-depoloyment.yml:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: alertmanager
  namespace: monitoring
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      containers:
      - name: prometheus-alertmanager
        image: prom/alertmanager:v0.14.0
        args:
          - --config.file=/etc/config/alertmanager.yml
          - --storage.path=/data
          - --web.external-url=/
        ports:
          - containerPort: 9093
        volumeMounts:
          - mountPath: /etc/config
            name: config-volume
          - mountPath: /data
            name: storage-volume
      - name: prometheus-alertmanager-configmap-reload
        image: jimmidyson/configmap-reload:v0.1
        args:
          - --volume-dir=/etc/config
          - --webhook-url=http://localhost:9093/-/reload
        volumeMounts:
          - mountPath: /etc/config
            name: config-volume
            readOnly: true
      volumes:
        - configMap:
            defaultMode: 420
            name: alertmanager-conf
          name: config-volume
        - emptyDir: {}
          name: storage-volume
alertmanager-service.yml:
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
  namespace: monitoring
  labels:
    app: alertmanager
  annotations:
      prometheus.io/scrape: 'true'
      prometheus.io/port:   '9093'
spec:
  selector:
    app: alertmanager
  type: NodePort
  ports:
  - port: 9093
    targetPort: 9093
    nodePort: 8081
Update the Prometheus config to include changes to rules and add the Alertmanager.prometheus-config-map.yml:
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server-conf
  labels:
    name: prometheus-server-conf
  namespace: monitoring
data:
  prometheus.yml: |-
    global:
      scrape_interval: 5s
      evaluation_interval: 5s

    alerting:
      alertmanagers:
      - kubernetes_sd_configs:
        - role: endpoints
        relabel_configs:
        - source_labels: [__meta_kubernetes_service_name]
          regex: alertmanager
          action: keep
        - source_labels: [__meta_kubernetes_namespace]
          regex: monitoring
          action: keep
        - source_labels: [__meta_kubernetes_pod_container_port_number]
          action: keep
          regex: 9093

    rule_files:
      - "/var/prometheus/rules/*_rules.yml"
      - "/var/prometheus/rules/*_alerts.yml"

    scrape_configs:
      - job_name: 'node-exporter'
        static_configs:
        - targets: ['<KUBERNETES_IP>:9100', '<KUBERNETES_IP>:9100']

      - job_name: 'kubernetes-apiservers'

        kubernetes_sd_configs:
        - role: endpoints
        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'kubernetes-nodes'

        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        kubernetes_sd_configs:
        - role: node

        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics

      - job_name: 'kubernetes-pods'

        kubernetes_sd_configs:
        - role: pod

        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::d+)?;(d+)
          replacement: $1:$2
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name

      - job_name: 'kubernetes-cadvisor'

        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        kubernetes_sd_configs:
        - role: node

        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

      - job_name: 'kubernetes-service-endpoints'

        kubernetes_sd_configs:
        - role: endpoints

        relabel_configs:
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::d+)?;(d+)
          replacement: $1:$2
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_name
Create a Config Map that will be used to manage the recording and alerting rules.prometheus-rules-config-map.yml:
apiVersion: v1
kind: ConfigMap
metadata:
  creationTimestamp: null
  name: prometheus-rules-conf
  namespace: monitoring
data:
  kubernetes_alerts.yml: |
    groups:
      - name: kubernetes_alerts
        rules:
        - alert: DeploymentGenerationOff
          expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
          for: 5m
          labels:
            severity: warning
          annotations:
            description: Deployment generation does not match expected generation {{ $labels.namespace }}/{{ $labels.deployment }}
            summary: Deployment is outdated
        - alert: DeploymentReplicasNotUpdated
          expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
            or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
            unless (kube_deployment_spec_paused == 1)
          for: 5m
          labels:
            severity: warning
          annotations:
            description: Replicas are not updated and available for deployment {{ $labels.namespace }}/{{ $labels.deployment }}
            summary: Deployment replicas are outdated
        - alert: PodzFrequentlyRestarting
          expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
          for: 10m
          labels:
            severity: warning
          annotations:
            description: Pod {{ $labels.namespace }}/{{ $labels.pod }} was restarted {{ $value }} times within the last hour
            summary: Pod is restarting frequently
        - alert: KubeNodeNotReady
          expr: kube_node_status_condition{condition="Ready",status="true"} == 0
          for: 1h
          labels:
            severity: warning
          annotations:
            description: The Kubelet on {{ $labels.node }} has not checked in with the API,
              or has set itself to NotReady, for more than an hour
            summary: Node status is NotReady
        - alert: KubeManyNodezNotReady
          expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0)
            > 1 and (count(kube_node_status_condition{condition="Ready",status="true"} ==
            0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2
          for: 1m
          labels:
            severity: critical
          annotations:
            description: '{{ $value }}% of Kubernetes nodes are not ready'
        - alert: APIHighLatency
          expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4
          for: 10m
          labels:
            severity: critical
          annotations:
            description: the API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}
        - alert: APIServerErrorsHigh
          expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) * 100 > 5
          for: 10m
          labels:
            severity: critical
          annotations:
            description: API server returns errors for {{ $value }}% of requests
        - alert: KubernetesAPIServerDown
          expr: up{job="kubernetes-apiservers"} == 0
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: Apiserver {{ $labels.instance }} is down!
        - alert: KubernetesAPIServersGone
          expr:  absent(up{job="kubernetes-apiservers"})
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: No Kubernetes apiservers are reporting!
            description: Werner Heisenberg says - OMG Where are my apiserverz?
  prometheus_alerts.yml: |
    groups:
    - name: prometheus_alerts
      rules:
      - alert: PrometheusConfigReloadFailed
        expr: prometheus_config_last_reload_successful == 0
        for: 10m
        labels:
          severity: warning
        annotations:
          description: Reloading Prometheus configuration has failed on {{$labels.instance}}.
      - alert: PrometheusNotConnectedToAlertmanagers
        expr: prometheus_notifications_alertmanagers_discovered < 1
        for: 1m
        labels:
          severity: warning
        annotations:
          description: Prometheus {{ $labels.instance}} is not connected to any Alertmanagers
  node_alerts.yml: |
    groups:
    - name: node_alerts
      rules:
      - alert: HighNodeCPU
        expr: instance:node_cpu:avg_rate5m > 80
        for: 10s
        labels:
          severity: warning
        annotations:
          summary: High Node CPU of {{ humanize $value}}% for 1 hour
      - alert: DiskWillFillIn4Hours
        expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4*3600) < 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Disk on {{ $labels.instance }} will fill in approximately 4 hours.
      - alert: KubernetesServiceDown
        expr: up{job="kubernetes-service-endpoints"} == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: Pod {{ $labels.instance }} is down!
      - alert: KubernetesServicesGone
        expr:  absent(up{job="kubernetes-service-endpoints"})
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: No Kubernetes services are reporting!
          description: Werner Heisenberg says - OMG Where are my servicez?
      - alert: CriticalServiceDown
        expr: node_systemd_unit_state{state="active"} != 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Service {{ $labels.name }} failed to start.
          description: Service {{ $labels.instance }} failed to (re)start service {{ $labels.name }}.
  redis_alerts.yml: |
    groups:
    - name: redis_alerts
      rules:
      - alert: RedisCacheMissesHigh
        expr: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total) > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Redis Server {{ $labels.instance }} Cache Misses are high.
      - alert: RedisRejectedConnectionsHigh
        expr: redis_connected_clients{} > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Redis instance {{ $labels.addr }} may be hitting maxclient limit."
          description: "The Redis instance at {{ $labels.addr }} had {{ $value }} rejected connections during the last 10m and may be hitting the maxclient limit."
      - alert: RedisServerDown
        expr: redis_up{app="media-redis"} == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: Redis Server {{ $labels.instance }} is down!
      - alert: RedisServerGone
        expr:  absent(redis_up{app="media-redis"})
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: No Redis servers are reporting!
          description: Werner Heisenberg says - there is no uncertainty about the Redis server being gone.
  kubernetes_rules.yml: |
    groups:
      - name: kubernetes_rules
        rules:
        - record: apiserver_latency_seconds:quantile
          expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
          labels:
            quantile: "0.99"
        - record: apiserver_latency_seconds:quantile
          expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
          labels:
            quantile: "0.9"
        - record: apiserver_latency_seconds:quantile
          expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
          labels:
            quantile: "0.5"
  node_rules.yml: |
    groups:
    - name: node_rules
      rules:
        - record: instance:node_cpu:avg_rate5m
          expr: 100 - avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) by (instance) * 100
        - record: instance:node_memory_usage:percentage
          expr: ((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100
        - record: instance:root:node_filesystem_usage:percentage
          expr: (node_filesystem_size_bytes{mountpoint="/rootfs"} - node_filesystem_free_bytes{mountpoint="/rootfs"}) /node_filesystem_size_bytes{mountpoint="/rootfs"} * 100
  redis_rules.yml: |
    groups:
    - name: redis_rules
      rules:
      - record: redis:command_call_duration_seconds_count:rate2m
        expr: sum(irate(redis_command_call_duration_seconds_count[2m])) by (cmd, environment)
      - record: redis:total_requests:rate2m
        expr: rate(redis_commands_processed_total[2m])
Update the volumes by the Prometheus deployment.prometheus-deployment.yml:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prometheus-deployment
  namespace: monitoring
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: prometheus-server
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:v2.2.1
          args:
            - "--config.file=/etc/prometheus/prometheus.yml"
            - "--storage.tsdb.path=/prometheus/"
            - "--web.enable-lifecycle"
          ports:
            - containerPort: 9090
          volumeMounts:
            - name: prometheus-config-volume
              mountPath: /etc/prometheus/
            - name: prometheus-rules-volume
              mountPath: /var/prometheus/rules
            - name: prometheus-storage-volume
              mountPath: /prometheus/
        - name: watch
          image: weaveworks/watch:master-5b2a6e5
          imagePullPolicy: IfNotPresent
          args: ["-v", "-t", "-p=/etc/prometheus", "-p=/var/prometheus", "curl", "-X", "POST", "--fail", "-o", "-", "-sS", "http://localhost:9090/-/reload"]
          volumeMounts:
            - name: prometheus-config-volume
              mountPath: /etc/prometheus
            - name: prometheus-rules-volume
              mountPath: /var/prometheus/rules
      volumes:
        - name: prometheus-config-volume
          configMap:
            defaultMode: 420
            name: prometheus-server-conf
        - name: prometheus-rules-volume
          configMap:
            name: prometheus-rules-conf
        - name: prometheus-storage-volume
          emptyDir: {}

Alerting Rules

00:06:49

Lesson Description:

In this lesson, you will learn how to create alerting rules that will be used to send alerts to Alertmanager. Below are the rules that were created in the previous lesson:

apiVersion: v1
kind: ConfigMap
metadata:
  creationTimestamp: null
  name: prometheus-rules-conf
  namespace: monitoring
data:
  kubernetes_alerts.yml: |
    groups:
      - name: kubernetes_alerts
        rules:
        - alert: DeploymentGenerationOff
          expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
          for: 5m
          labels:
            severity: warning
          annotations:
            description: Deployment generation does not match expected generation {{ $labels.namespace }}/{{ $labels.deployment }}
            summary: Deployment is outdated
        - alert: DeploymentReplicasNotUpdated
          expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
            or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
            unless (kube_deployment_spec_paused == 1)
          for: 5m
          labels:
            severity: warning
          annotations:
            description: Replicas are not updated and available for deployment {{ $labels.namespace }}/{{ $labels.deployment }}
            summary: Deployment replicas are outdated
        - alert: PodzFrequentlyRestarting
          expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
          for: 10m
          labels:
            severity: warning
          annotations:
            description: Pod {{ $labels.namespace }}/{{ $labels.pod }} was restarted {{ $value }} times within the last hour
            summary: Pod is restarting frequently
        - alert: KubeNodeNotReady
          expr: kube_node_status_condition{condition="Ready",status="true"} == 0
          for: 1h
          labels:
            severity: warning
          annotations:
            description: The Kubelet on {{ $labels.node }} has not checked in with the API,
              or has set itself to NotReady, for more than an hour
            summary: Node status is NotReady
        - alert: KubeManyNodezNotReady
          expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0)
            > 1 and (count(kube_node_status_condition{condition="Ready",status="true"} ==
            0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2
          for: 1m
          labels:
            severity: critical
          annotations:
            description: '{{ $value }}% of Kubernetes nodes are not ready'
        - alert: APIHighLatency
          expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4
          for: 10m
          labels:
            severity: critical
          annotations:
            description: the API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}
        - alert: APIServerErrorsHigh
          expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) * 100 > 5
          for: 10m
          labels:
            severity: critical
          annotations:
            description: API server returns errors for {{ $value }}% of requests
        - alert: KubernetesAPIServerDown
          expr: up{job="kubernetes-apiservers"} == 0
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: Apiserver {{ $labels.instance }} is down!
        - alert: KubernetesAPIServersGone
          expr:  absent(up{job="kubernetes-apiservers"})
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: No Kubernetes apiservers are reporting!
            description: Werner Heisenberg says - OMG Where are my apiserverz?
  prometheus_alerts.yml: |
    groups:
    - name: prometheus_alerts
      rules:
      - alert: PrometheusConfigReloadFailed
        expr: prometheus_config_last_reload_successful == 0
        for: 10m
        labels:
          severity: warning
        annotations:
          description: Reloading Prometheus configuration has failed on {{$labels.instance}}.
      - alert: PrometheusNotConnectedToAlertmanagers
        expr: prometheus_notifications_alertmanagers_discovered < 1
        for: 1m
        labels:
          severity: warning
        annotations:
          description: Prometheus {{ $labels.instance}} is not connected to any Alertmanagers
  node_alerts.yml: |
    groups:
    - name: node_alerts
      rules:
      - alert: HighNodeCPU
        expr: instance:node_cpu:avg_rate5m > 80
        for: 10s
        labels:
          severity: warning
        annotations:
          summary: High Node CPU of {{ humanize $value}}% for 1 hour
      - alert: DiskWillFillIn4Hours
        expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4*3600) < 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Disk on {{ $labels.instance }} will fill in approximately 4 hours.
      - alert: KubernetesServiceDown
        expr: up{job="kubernetes-service-endpoints"} == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: Pod {{ $labels.instance }} is down!
      - alert: KubernetesServicesGone
        expr:  absent(up{job="kubernetes-service-endpoints"})
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: No Kubernetes services are reporting!
          description: Werner Heisenberg says - OMG Where are my servicez?
      - alert: CriticalServiceDown
        expr: node_systemd_unit_state{state="active"} != 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Service {{ $labels.name }} failed to start.
          description: Service {{ $labels.instance }} failed to (re)start service {{ $labels.name }}.
  redis_alerts.yml: |
    groups:
    - name: redis_alerts
      rules:
      - alert: RedisCacheMissesHigh
        expr: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total) > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Redis Server {{ $labels.instance }} Cache Misses are high.
      - alert: RedisRejectedConnectionsHigh
        expr: redis_connected_clients{} > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Redis instance {{ $labels.addr }} may be hitting maxclient limit."
          description: "The Redis instance at {{ $labels.addr }} had {{ $value }} rejected connections during the last 10m and may be hitting the maxclient limit."
      - alert: RedisServerDown
        expr: redis_up{app="media-redis"} == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: Redis Server {{ $labels.instance }} is down!
      - alert: RedisServerGone
        expr:  absent(redis_up{app="media-redis"})
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: No Redis servers are reporting!
          description: Werner Heisenberg says - there is no uncertainty about the Redis server being gone.
  kubernetes_rules.yml: |
    groups:
      - name: kubernetes_rules
        rules:
        - record: apiserver_latency_seconds:quantile
          expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
          labels:
            quantile: "0.99"
        - record: apiserver_latency_seconds:quantile
          expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
          labels:
            quantile: "0.9"
        - record: apiserver_latency_seconds:quantile
          expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
          labels:
            quantile: "0.5"
  node_rules.yml: |
    groups:
    - name: node_rules
      rules:
        - record: instance:node_cpu:avg_rate5m
          expr: 100 - avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) by (instance) * 100
        - record: instance:node_memory_usage:percentage
          expr: ((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100
        - record: instance:root:node_filesystem_usage:percentage
          expr: (node_filesystem_size_bytes{mountpoint="/rootfs"} - node_filesystem_free_bytes{mountpoint="/rootfs"}) /node_filesystem_size_bytes{mountpoint="/rootfs"} * 100
  redis_rules.yml: |
    groups:
    - name: redis_rules
      rules:
      - record: redis:command_call_duration_seconds_count:rate2m
        expr: sum(irate(redis_command_call_duration_seconds_count[2m])) by (cmd, environment)
      - record: redis:total_requests:rate2m
        expr: rate(redis_commands_processed_total[2m])

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

01:00:00

Final Steps

Next Steps

00:01:05

Lesson Description:

Not sure what to take next? Maybe these courses will pique your interest.

Take this course and learn a new skill today.

Transform your learning with our all access plan.

Start 7-Day Free Trial