Skip to main content

Monitoring Kubernetes With Prometheus

Course

Intro Video

Photo of Travis Thomsen

Travis Thomsen

Course Development Director in Content

I have over 17 years of experience in all phases of the software development life cycle, which includes software analysis, design, development, testing, implementation, debugging, maintenance and documentation. I am passionate about learning new technologies, methodologies, languages and automation.

Length

05:03:57

Difficulty

Intermediate

Videos

24

Hands-on Labs

3

Course Details

Are you interested in deploying Prometheus to Kubernetes? If so, this is the course for you.

This course covers the basics of Prometheus, which includes its architecture and components, such as exporters, client libraries, and alerting.

From there, you will learn how to deploy Prometheus to Kubernetes and configure Prometheus to monitor the cluster as well as applications deployed to it.

You will also learn the basics of PromQL, which includes the syntax, functions, and creating recording rules.

Finally, the course will close out by talking about the Alertmanager and creating alerting rules.

Donwload the Interactive Diagrams here:

https://interactive.linuxacademy.com/diagrams/MonitoringKubernetswithPrometheus.html

https://interactive.linuxacademy.com/diagrams/ApplicationMetrics.html

https://interactive.linuxacademy.com/diagrams/ExporterMetrics.html

https://interactive.linuxacademy.com/diagrams/NodeExporter.html

Syllabus

Introduction

Introduction

About This Course

00:01:57

Lesson Description:

This video will go over the highlights of this course:Prometheus ArchitectureRun Prometheus on KubernetesApplication MonitoringPromQLAlerting I will also discuss the prerequisites for this course.

About the Instructor

00:00:55

Lesson Description:

Before we get started on the course, let's learn a little about who is teaching it!

What is Prometheus?

00:01:39

Lesson Description:

Before we jump into the technical details of this course, we will take a five thousand foot view of what Prometheus is.

Setting Up Your Environment

Using Cloud Playground

00:06:16

Lesson Description:

In this video, you will learn how to use Cloud Playground to create the Cloud Servers you will need to complete this course. You will also be shown how to use the web terminal as an alternative to using SSH.

Setting Up a Kubernetes Cluster

00:07:57

Lesson Description:

In this lesson, you will setup your Kubernetes cluster. We will start by installing the Master node. Setting up the Kubernetes Master The following actions will be executed on the Kubernetes Master.Disable swap

swapoff -a
Edit: /etc/fstab
vi /etc/fstab
Comment out swap
#/root/swap swap swap sw 0 0
Add the Kubernetes repo
cat << EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
exclude=kube*
EOF
Disable SELinux
setenforce 0
Permanently disable SELinux:
vi /etc/selinux/config
Change enforcing to disabled
SELINUX=disabled
Install Kubernetes 1.11.3
yum install -y kubelet-1.11.3 kubeadm-1.11.3 kubectl-1.11.3 kubernetes-cni-0.6.0 --disableexcludes=kubernetes
Start and enable the Kubernetes service
systemctl start kubelet && systemctl enable kubelet
Create the k8s.conf file:
cat << EOF >  /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF
sysctl --system
Create kube-config.yml:
vi kube-config.yml
Add the following to kube-config.yml:
apiVersion: kubeadm.k8s.io/v1alpha1
kind:
kubernetesVersion: "v1.11.3"
networking:
  podSubnet: 10.244.0.0/16
apiServerExtraArgs:
  service-node-port-range: 8000-31274
Initialize Kubernetes
kubeadm init --config kube-config.yml
Copy admin.conf to your home directory
mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
Install flannel
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/v0.9.1/Documentation/kube-flannel.yml
Patch flannel
vi /etc/kubernetes/manifests/kube-controller-manager.yaml
Add the following to kube-controller-manager.yaml:
--allocate-node-cidrs=true
--cluster-cidr=10.244.0.0/16
Then reolad kubelete
systemctl restart kubelet
Setting up the Kubernetes Worker Now that the setup for the Kubernetes master is complete, we will begin the process of configuring the worker node. The following actions will be executed on the Kubernetes worker.Disable swap
swapoff -a
Edit: /etc/fstab
vi /etc/fstab
Comment out swap
#/root/swap swap swap sw 0 0
Add the Kubernetes repo
cat << EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
exclude=kube*
EOF
Disable SELinux
setenforce 0
Permanently disable SELinux:
vi /etc/selinux/config
Change enforcing to disabled
SELINUX=disabled
Install Kubernetes 1.11.3
yum install -y kubelet-1.11.3 kubeadm-1.11.3 kubectl-1.11.3 kubernetes-cni-0.6.0 --disableexcludes=kubernetes
Start and enable the Kubernetes service
systemctl start kubelet && systemctl enable kubelet
Create the k8s.conf file:
cat << EOF >  /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF
sysctl --system
Use the join token to add the Worker Node to the cluster:
kubeadm join < MASTER_IP >:6443 --token < TOKEN > --discovery-token-ca-cert-hash sha256:< HASH >
On the master node, test to see if the cluster was created properly. Get a listing of the nodes:
kubectl get nodes

Monitoring Kubernetes with Prometheus

Prometheus Architecture

Prometheus Architecture Diagram

00:02:57

Lesson Description:

In this lesson, we will be reviewing the Prometheus Architecture Diagram and go over the various components.

Client Libraries

00:01:21

Lesson Description:

You use client libraries and instrumentation to gather metrics for Prometheus to scrape. Prometheus scrapes your application's HTTP endpoint. Client libraries send the current state of all metrics tracked to the Prometheus server. You can develop your own client library if one doesn't exist. This is the code used to instrument the app using the NodeJS library prom-client:

var Register = require('prom-client').register;
var Counter = require('prom-client').Counter;
var Histogram = require('prom-client').Histogram;
var Summary = require('prom-client').Summary;
var ResponseTime = require('response-time');


module.exports.totalNumOfRequests = totalNumOfRequests = new Counter({
    name: 'totalNumOfRequests',
    help: 'Total number of requests made',
    labelNames: ['method']
});

module.exports.pathsTaken = pathsTaken = new Counter({
    name: 'pathsTaken',
    help: 'Paths taken in the app',
    labelNames: ['path']
});

module.exports.responses = responses = new Summary({
    name: 'responses',
    help: 'Response time in millis',
    labelNames: ['method', 'path', 'status']
});

module.exports.startCollection = function () {
    require('prom-client').collectDefaultMetrics();
};

module.exports.requestCounters = function (req, res, next) {
    if (req.path != '/metrics') {
        totalNumOfRequests.inc({ method: req.method });
        pathsTaken.inc({ path: req.path });
    }
    next();
}

module.exports.responseCounters = ResponseTime(function (req, res, time) {
    if(req.url != '/metrics') {
        responses.labels(req.method, req.url, res.statusCode).observe(time);
    }
})

module.exports.injectMetricsRoute = function (App) {
    App.get('/metrics', (req, res) => {
        res.set('Content-Type', Register.contentType);
        res.end(Register.metrics());
    });
};
Prometheus supported libraries:GoJava or ScalaPythonRuby Third-party libraries:BashC++Common LispElixirErlangHaskellLua for NginxLua for Tarantool.NET / C#Node.jsPerlPHPRust

Exporters

00:01:41

Lesson Description:

Exporters are software that is deployed next to the application that you want to have metrics collected from. Instrumentation for exporters are known as custom collectors or ConstMetrics. How exporters work:Takes requestsGathers the dataFormats the dataReturns the data to Prometheus Databases:Consul exporterMemcached exporterMySQL server exporter Hardware:Node/system metrics exporter HTTP:HAProxy exporter Other monitoring systems:AWS CloudWatch exporterCollectd exporterGraphite exporterInfluxDB exporterJMX exporterSNMP exporterStatsD exporter Miscellaneous:Blackbox exporter

Service Discovery

00:02:42

Lesson Description:

In this lesson you will learn about Service Discovery. Service Discovery is way for Prometheus to find targets without having to statically configure them.

Scraping

00:01:26

Lesson Description:

In this lesson, you will learn the difference between push and pull monitoring systems. We will also discuss how Prometheus defines scraping.

Run Prometheus on Kubernetes

Setting Up Prometheus

00:15:57

Lesson Description:

In this lesson, we will set up Prometheus on the Kubernetes cluster. We will be creating:A metrics namespace for our environment to live inA ClusterRole to give Prometheus access to targets using Service DiscoveryA ConfigMap map that will be used to generate the Prometheus config fileA Prometheus Deployment and ServiceKube State Metrics to get access to metrics on the Kubernetes API You can clone the YAML files form Github. Create a file called namespaces.yml. This file will be used to create the monitoring namespace.namespaces.yml

{
  "kind": "Namespace",
  "apiVersion": "v1",
  "metadata": {
    "name": "monitoring",
    "labels": {
      "name": "monitoring"
    }
  }
}
Apply the namespace:
kubectl apply -f namespaces.yml
Create a file called clusterRole.yml. This will be used to set up the cluster's roles.clusterRole.yml:
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: default
  namespace: monitoring
Apply the cluster roles to the Kubernetes cluster:
kubectl apply -f clusterRole.yml
Create config-map.yml. Kubernetes will use this file to manage the prometheus.yml configuration file.config-map.yml:
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server-conf
  labels:
    name: prometheus-server-conf
  namespace: monitoring
data:
  prometheus.yml: |-
    global:
      scrape_interval: 5s
      evaluation_interval: 5s

    scrape_configs:
      - job_name: 'kubernetes-apiservers'

        kubernetes_sd_configs:
        - role: endpoints
        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

      - job_name: 'kubernetes-nodes'

        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        kubernetes_sd_configs:
        - role: node

        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics


      - job_name: 'kubernetes-pods'

        kubernetes_sd_configs:
        - role: pod

        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: ([^:]+)(?::d+)?;(d+)
          replacement: $1:$2
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name

      - job_name: 'kubernetes-cadvisor'

        scheme: https

        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

        kubernetes_sd_configs:
        - role: node

        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

      - job_name: 'kubernetes-service-endpoints'

        kubernetes_sd_configs:
        - role: endpoints

        relabel_configs:
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::d+)?;(d+)
          replacement: $1:$2
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_name
Create the ConfigMap:
kubectl apply -f config-map.yml
Create prometheus-deployment.yml. This file will be used to create the Prometheus deployment; which will include the pods, replica sets and volumes.prometheus-deployment.yml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prometheus-deployment
  namespace: monitoring
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: prometheus-server
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:v2.2.1
          args:
            - "--config.file=/etc/prometheus/prometheus.yml"
            - "--storage.tsdb.path=/prometheus/"
            - ""--web.enable-lifecycle"
          ports:
            - containerPort: 9090
          volumeMounts:
            - name: prometheus-config-volume
              mountPath: /etc/prometheus/
            - name: prometheus-storage-volume
              mountPath: /prometheus/
      volumes:
        - name: prometheus-config-volume
          configMap:
            defaultMode: 420
            name: prometheus-server-conf

        - name: prometheus-storage-volume
          emptyDir: {}
Deploy the Prometheus environment:
kubectl apply -f prometheus-deployment.yml
Finally, we will finish off the Prometheus environment by creating a server to make publicly accessible. Create prometheus-service.yml.prometheus-service.yml:
apiVersion: v1
kind: Service
metadata:
  name: prometheus-service
  namespace: monitoring
  annotations:
      prometheus.io/scrape: 'true'
      prometheus.io/port:   '9090'

spec:
  selector:
    app: prometheus-server
  type: NodePort
  ports:
    - port: 8080
      targetPort: 9090
      nodePort: 8080
Create the service that will make Prometheus publicly accessible:
kubectl apply -f prometheus-service.yml
Create the clusterRole.yml file to set up access so Prometheus can access metrics using Service Discovery.clusterRole.yml:
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: default
  namespace: monitoring

Crate the Kube State Metrics pod to get access to metrics on the Kubernetes API:kube-state-metrics.yml:
apiVersion: v1
kind: Service
metadata:
  name: kube-state-metrics
  namespace: monitoring
  labels:
    app: kube-state-metrics
  annotations:
    prometheus.io/scrape: 'true'
spec:
  ports:
  - name: metrics
    port: 8080
    targetPort: metrics
    protocol: TCP
  selector:
    app: kube-state-metrics
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: monitoring
  labels:
    app: kube-state-metrics
spec:
  replicas: 1
  template:
    metadata:
      name: kube-state-metrics-main
      labels:
        app: kube-state-metrics
    spec:
      containers:
        - name: kube-state-metrics
          image: quay.io/coreos/kube-state-metrics:latest
          ports:
          - containerPort: 8080
            name: metrics
Access Prometheus by visiting https://<MASTER_IP>:8080

Configuring Prometheus

00:13:10

Lesson Description:

In this lesson you will learn about the Prometheus configuration file, how to configure static targets, as well as how to use service discovery to find Kubernetes endpoints. Below is the contents of prometheus.conf that was created by the Config Map. prometheus.conf:

global:
  scrape_interval: 5s
  evaluation_interval: 5s

scrape_configs:
  - job_name: 'kubernetes-apiservers'

    kubernetes_sd_configs:
    - role: endpoints
    scheme: https

    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    relabel_configs:
    - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
      action: keep
      regex: default;kubernetes;https

  - job_name: 'kubernetes-nodes'

    scheme: https

    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    kubernetes_sd_configs:
    - role: node

    relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
    - target_label: __address__
      replacement: kubernetes.default.svc:443
    - source_labels: [__meta_kubernetes_node_name]
      regex: (.+)
      target_label: __metrics_path__
      replacement: /api/v1/nodes/${1}/proxy/metrics


  - job_name: 'kubernetes-pods'

    kubernetes_sd_configs:
    - role: pod

    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      regex: ([^:]+)(?::d+)?;(d+)
      replacement: $1:$2
      target_label: __address__
    - action: labelmap
      regex: __meta_kubernetes_pod_label_(.+)
    - source_labels: [__meta_kubernetes_namespace]
      action: replace
      target_label: kubernetes_namespace
    - source_labels: [__meta_kubernetes_pod_name]
      action: replace
      target_label: kubernetes_pod_name

  - job_name: 'kubernetes-cadvisor'

    scheme: https

    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    kubernetes_sd_configs:
    - role: node

    relabel_configs:
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
    - target_label: __address__
      replacement: kubernetes.default.svc:443
    - source_labels: [__meta_kubernetes_node_name]
      regex: (.+)
      target_label: __metrics_path__
      replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

  - job_name: 'kubernetes-service-endpoints'

    kubernetes_sd_configs:
    - role: endpoints

    relabel_configs:
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
      action: keep
      regex: true
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
      action: replace
      target_label: __scheme__
      regex: (https?)
    - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
    - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
      action: replace
      target_label: __address__
      regex: ([^:]+)(?::d+)?;(d+)
      replacement: $1:$2
    - action: labelmap
      regex: __meta_kubernetes_service_label_(.+)
    - source_labels: [__meta_kubernetes_namespace]
      action: replace
      target_label: kubernetes_namespace
    - source_labels: [__meta_kubernetes_service_name]
      action: replace
      target_label: kubernetes_name
Prometheus Configuration Documentation

Setting Up Grafana

00:04:40

Lesson Description:

In this lesson, you will learn how to deploy a Grafana pod and service to Kubernetes. Create grafana-deployment.yml. This file will be used to create the Grafana deployment. Be sure to change the password. grafana-deployment.yml:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: grafana
  namespace: monitoring
  labels:
    app: grafana
    component: core
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: grafana
        component: core
    spec:
      containers:
        - image: grafana/grafana:3.1.1
          name: grafana
          env:
            - name: GF_SECURITY_ADMIN_PASSWORD
              value: password
          ports:
            - containerPort: 3000
          volumeMounts:
          - name: grafana-persistent-storage
            mountPath: /var
      volumes:
      - name: grafana-persistent-storage
        emptyDir: {}
Deploy Grafana:
kubectl apply -f grafana-deployment.yml
Crate grafana-service.yml. This file will be used to make the pod publicly accessible. grafana-service.yml:
apiVersion: v1
kind: Service
metadata:
  name: grafana-service
  namespace: monitoring

spec:
  selector:
    app: grafana
  type: NodePort
  ports:
    - port: 3000
      targetPort: 3000
      nodePort: 8000
Create the Grafana service:
kubectl apply -f grafana-service.yml

NodeExporter

00:05:13

Lesson Description:

Repeat these steps on both your master and worker nodes. Create the Prometheus user:

adduser prometheus
Download Node Exporter:
cd /home/prometheus
curl -LO "https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.linux-amd64.tar.gz"
tar -xvzf node_exporter-0.16.0.linux-amd64.tar.gz
mv node_exporter-0.16.0.linux-amd64 node_exporter
cd node_exporter
chown prometheus:prometheus node_exporter
vi /etc/systemd/system/node_exporter.service
/etc/systemd/system/node_exporter.service:
[Unit]
Description=Node Exporter

[Service]
User=prometheus
ExecStart=/home/prometheus/node_exporter/node_exporter

[Install]
WantedBy=default.target
Reload systemd:
systemctl daemon-reload
Enable the node_exporter service:
systemctl enable node_exporter.service
Start the node_exporter service:
systemctl start node_exporter.service

Expression Browser

00:04:40

Lesson Description:

In this lesson, you will learn how to use the Expression browser to execute queries, view your Prometheus configuration, and Prometheus targets. Container CPU load average:

container_cpu_load_average_10s
Memory usage query:
((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100

Adding a Grafana Dashboard

00:03:54

Lesson Description:

In this lesson, you will import a Grafana Dashboard that will be used to visualize metrics imported from the NodeExporter. Below are the links to the dashboard. Content Kubernetes Prometheus Env Repository Kubernetes Nodes Dashboard

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

01:00:00

Application Monitoring

Instrumenting Applications

00:05:37

Lesson Description:

This lesson discusses how to instrument an application by using a Prometheus client library. Though we will be talking about a NodeJS application, there are client libraries available for a wide variety of programming languages. You can clone the Comic Box App here.

Collecting Metrics from Applications

00:06:08

Lesson Description:

In this lesson, you will deploy a NodeJS application to Kubernets that will be monitored by Prometheus. Github Link: https://github.com/linuxacademy/content-kubernetes-prometheus-app Build a Docker image:

docker build -t rivethead42/comicbox .
Login to Docker Hub:
docker login
Push the image to Docker Hub:
docker push < USERNAME >/comicbox
Create a deployment using the image above:
kubectl apply -f deployment.yml

PromQL

PromQL Basics

00:04:42

Lesson Description:

In this lesson, you will learn the basics of Prometheus' query language—PromQL. This includes queries using the metric name and then filtering it using labels. Return all time series with the metric node_cpu_seconds_total:

node_cpu_seconds_total
Return all time series with the metric node_cpu_seconds_total and the given job and mode labels:
node_cpu_seconds_total{job="node-exporter", mode="idle"}
Return a whole range of time (in this case 5 minutes) for the same vector, making it a range vector:
node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m]
Query job that end with -exporter:
node_cpu_seconds_total{job=~".*-exporter"}
Query job that begins with kube:
container_cpu_load_average_10s{job=~"^kube.*"}

PromQL Operations and Functions

00:03:32

Lesson Description:

In this lesson, you will learn how to add operations and functions to your PromQL expressions. Arithmetic binary operators:+ (addition)- (subtraction)* (multiplication)/ (division)% (modulo)^ (power/exponentiation) Comparison binary operators:== (equal)!= (not-equal)> (greater-than)< (less-than)>= (greater-or-equal)<= (less-or-equal) Logical/set binary operators:and (intersection)or (union)unless (complement) Aggregation operators:sum (calculate sum over dimensions)min (select minimum over dimensions)max (select maximum over dimensions)avg (calculate the average over dimensions)stddev (calculate population standard deviation over dimensions)stdvar (calculate population standard variance over dimensions)count (count number of elements in the vector)count_values (count number of elements with the same value)bottomk (smallest k elements by sample value)topk (largest k elements by sample value)quantile (calculate ?-quantile (0 ? ? ? 1) over dimensions) Get the total memory in bytes:

node_memory_MemTotal_bytes
Get a sum of the total memory in bytes:
sum(node_memory_MemTotal_bytes)
Get a percentage of total memory used:
((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100
Using a function with your query:
irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])
Using an operation and a function with your query:
avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m]))
Grouping your queries:
avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) by (instance)

Recording Rules

00:07:03

Lesson Description:

Create prometheus-read-rules-map.yml. This file will be used to create a recording rule for Prometheus.prometheus-read-rules-map.yml:

apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-read-rules-conf
labels:
 name: prometheus-read-rules-conf
namespace: monitoring
data:
node_rules.yml: |-
 groups:
 - name: node_rules
   interval: 10s
   rules:
     - record: instance:node_cpu:avg_rate5m
       expr: 100 - avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) by (instance) * 100
     - record: instance:node_memory_usage:percentage
       expr: ((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100
     - record: instance:root:node_filesystem_usage:percentage
       expr: (node_filesystem_size_bytes{mountpoint="/rootfs"} - node_filesystem_free_bytes{mountpoint="/rootfs"}) /node_filesystem_size_bytes{mountpoint="/rootfs"} * 100
Apply the recording rule:
kubectl apply -f prometheus-read-rules-map.yml
Update the prometheus-config-map.yml with record rules.prometheus-config-map.yml:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-server-conf
labels:
 name: prometheus-server-conf
namespace: monitoring
data:
prometheus.yml: |-
 global:
   scrape_interval: 5s
   evaluation_interval: 5s

 rule_files:
 - rules/*_rules.yml

 scrape_configs:
   - job_name: 'node-exporter'
     static_configs:
     - targets: ['<KUBERNETES_IP>:9100', '<KUBERNETES_IP>:9100']

   - job_name: 'kubernetes-apiservers'

     kubernetes_sd_configs:
     - role: endpoints
     scheme: https

     tls_config:
       ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
     bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

     relabel_configs:
     - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
       action: keep
       regex: default;kubernetes;https

   - job_name: 'kubernetes-nodes'

     scheme: https

     tls_config:
       ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
     bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

     kubernetes_sd_configs:
     - role: node

     relabel_configs:
     - action: labelmap
       regex: __meta_kubernetes_node_label_(.+)
     - target_label: __address__
       replacement: kubernetes.default.svc:443
     - source_labels: [__meta_kubernetes_node_name]
       regex: (.+)
       target_label: __metrics_path__
       replacement: /api/v1/nodes/${1}/proxy/metrics

   - job_name: 'kubernetes-pods'

     kubernetes_sd_configs:
     - role: pod

     relabel_configs:
     - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
       action: keep
       regex: true
     - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
       action: replace
       target_label: __metrics_path__
       regex: (.+)
     - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
       action: replace
       regex: ([^:]+)(?::d+)?;(d+)
       replacement: $1:$2
       target_label: __address__
     - action: labelmap
       regex: __meta_kubernetes_pod_label_(.+)
     - source_labels: [__meta_kubernetes_namespace]
       action: replace
       target_label: kubernetes_namespace
     - source_labels: [__meta_kubernetes_pod_name]
       action: replace
       target_label: kubernetes_pod_name

   - job_name: 'kubernetes-cadvisor'

     scheme: https

     tls_config:
       ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
     bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

     kubernetes_sd_configs:
     - role: node

     relabel_configs:
     - action: labelmap
       regex: __meta_kubernetes_node_label_(.+)
     - target_label: __address__
       replacement: kubernetes.default.svc:443
     - source_labels: [__meta_kubernetes_node_name]
       regex: (.+)
       target_label: __metrics_path__
       replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

   - job_name: 'kubernetes-service-endpoints'

     kubernetes_sd_configs:
     - role: endpoints

     relabel_configs:
     - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
       action: keep
       regex: true
     - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
       action: replace
       target_label: __scheme__
       regex: (https?)
     - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
       action: replace
       target_label: __metrics_path__
       regex: (.+)
     - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
       action: replace
       target_label: __address__
       regex: ([^:]+)(?::d+)?;(d+)
       replacement: $1:$2
     - action: labelmap
       regex: __meta_kubernetes_service_label_(.+)
     - source_labels: [__meta_kubernetes_namespace]
       action: replace
       target_label: kubernetes_namespace
     - source_labels: [__meta_kubernetes_service_name]
       action: replace
       target_label: kubernetes_name
Apply the update configuration file:
kubectl apply -f prometheus-config-map.yml
Add a new volume for the recording rules:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: prometheus-deployment
namespace: monitoring
spec:
replicas: 1
template:
 metadata:
   labels:
     app: prometheus-server
 spec:
   containers:
     - name: prometheus
       image: prom/prometheus:v2.2.1
       args:
         - "--config.file=/etc/prometheus/prometheus.yml"
         - "--storage.tsdb.path=/prometheus/"
         - "--web.enable-lifecycle"
       ports:
         - containerPort: 9090
       volumeMounts:
         - name: prometheus-config-volume
           mountPath: /etc/prometheus/
         - name: prometheus-storage-volume
           mountPath: /prometheus/
         - name: prometheus-read-rules-volume
           mountPath: /etc/prometheus/rules
     - name: watch
       image: weaveworks/watch:master-5b2a6e5
       imagePullPolicy: IfNotPresent
       args: ["-v", "-t", "-p=/etc/prometheus", "-p=/var/prometheus", "curl", "-X", "POST", "--fail", "-o", "-", "-sS", "http://localhost:9090/-/reload"]
       volumeMounts:
         - name: prometheus-config-volume
           mountPath: /etc/prometheus
   volumes:
     - name: prometheus-config-volume
       configMap:
         defaultMode: 420
         name: prometheus-server-conf

     - name: prometheus-read-rules-volume
       configMap:
         defaultMode: 420
         name: prometheus-read-rules-conf

     - name: prometheus-storage-volume
       emptyDir: {}
Apply the updates to the Prometheus deployment:
kubectl apply -f prometheus-deployment.yml

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

01:00:00

Alerting

Alertmanager

00:12:28

Lesson Description:

In this lesson, you will learn how to set up Alertmanager to work with Prometheus. Below are the files that will be used to complete this task:Create a Config Map that will be used to set up the Alertmanager config file.alertmanager-configmap.yml:

apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-conf
labels:
 name: alertmanager-conf
namespace: monitoring
data:
alertmanager.yml: |
 global:
   smtp_smarthost: 'localhost:25'
   smtp_from: 'alertmanager@linuxacademy.org'
   smtp_require_tls: false
 route:
   receiver: slack_receiver
 receivers:
 - name: slack_receiver
   slack_configs:
   - send_resolved: true
     username: '<SLACK_USER>'
     api_url: '<APP_URL>'
     channel: '#<CHANNEL>'
Create a deployment file that will be used to stand up the Alertmanager deployment.alertmanager-depoloyment.yml:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: alertmanager
namespace: monitoring
spec:
replicas: 1
template:
 metadata:
   labels:
     app: alertmanager
 spec:
   containers:
   - name: prometheus-alertmanager
     image: prom/alertmanager:v0.14.0
     args:
       - --config.file=/etc/config/alertmanager.yml
       - --storage.path=/data
       - --web.external-url=/
     ports:
       - containerPort: 9093
     volumeMounts:
       - mountPath: /etc/config
         name: config-volume
       - mountPath: /data
         name: storage-volume
   - name: prometheus-alertmanager-configmap-reload
     image: jimmidyson/configmap-reload:v0.1
     args:
       - --volume-dir=/etc/config
       - --webhook-url=http://localhost:9093/-/reload
     volumeMounts:
       - mountPath: /etc/config
         name: config-volume
         readOnly: true
   volumes:
     - configMap:
         defaultMode: 420
         name: alertmanager-conf
       name: config-volume
     - emptyDir: {}
       name: storage-volume
alertmanager-service.yml:
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: monitoring
labels:
 app: alertmanager
annotations:
   prometheus.io/scrape: 'true'
   prometheus.io/port:   '9093'
spec:
selector:
 app: alertmanager
type: NodePort
ports:
- port: 9093
 targetPort: 9093
 nodePort: 8081
Update the Prometheus config to include changes to rules and add the Alertmanager.prometheus-config-map.yml:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-server-conf
labels:
 name: prometheus-server-conf
namespace: monitoring
data:
prometheus.yml: |-
 global:
   scrape_interval: 5s
   evaluation_interval: 5s

 alerting:
   alertmanagers:
   - kubernetes_sd_configs:
     - role: endpoints
     relabel_configs:
     - source_labels: [__meta_kubernetes_service_name]
       regex: alertmanager
       action: keep
     - source_labels: [__meta_kubernetes_namespace]
       regex: monitoring
       action: keep
     - source_labels: [__meta_kubernetes_pod_container_port_number]
       action: keep
       regex: 9093

 rule_files:
   - "/var/prometheus/rules/*_rules.yml"
   - "/var/prometheus/rules/*_alerts.yml"

 scrape_configs:
   - job_name: 'node-exporter'
     static_configs:
     - targets: ['<KUBERNETES_IP>:9100', '<KUBERNETES_IP>:9100']

   - job_name: 'kubernetes-apiservers'

     kubernetes_sd_configs:
     - role: endpoints
     scheme: https

     tls_config:
       ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
     bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

     relabel_configs:
     - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
       action: keep
       regex: default;kubernetes;https

   - job_name: 'kubernetes-nodes'

     scheme: https

     tls_config:
       ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
     bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

     kubernetes_sd_configs:
     - role: node

     relabel_configs:
     - action: labelmap
       regex: __meta_kubernetes_node_label_(.+)
     - target_label: __address__
       replacement: kubernetes.default.svc:443
     - source_labels: [__meta_kubernetes_node_name]
       regex: (.+)
       target_label: __metrics_path__
       replacement: /api/v1/nodes/${1}/proxy/metrics

   - job_name: 'kubernetes-pods'

     kubernetes_sd_configs:
     - role: pod

     relabel_configs:
     - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
       action: keep
       regex: true
     - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
       action: replace
       target_label: __metrics_path__
       regex: (.+)
     - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
       action: replace
       regex: ([^:]+)(?::d+)?;(d+)
       replacement: $1:$2
       target_label: __address__
     - action: labelmap
       regex: __meta_kubernetes_pod_label_(.+)
     - source_labels: [__meta_kubernetes_namespace]
       action: replace
       target_label: kubernetes_namespace
     - source_labels: [__meta_kubernetes_pod_name]
       action: replace
       target_label: kubernetes_pod_name

   - job_name: 'kubernetes-cadvisor'

     scheme: https

     tls_config:
       ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
     bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

     kubernetes_sd_configs:
     - role: node

     relabel_configs:
     - action: labelmap
       regex: __meta_kubernetes_node_label_(.+)
     - target_label: __address__
       replacement: kubernetes.default.svc:443
     - source_labels: [__meta_kubernetes_node_name]
       regex: (.+)
       target_label: __metrics_path__
       replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

   - job_name: 'kubernetes-service-endpoints'

     kubernetes_sd_configs:
     - role: endpoints

     relabel_configs:
     - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
       action: keep
       regex: true
     - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
       action: replace
       target_label: __scheme__
       regex: (https?)
     - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
       action: replace
       target_label: __metrics_path__
       regex: (.+)
     - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
       action: replace
       target_label: __address__
       regex: ([^:]+)(?::d+)?;(d+)
       replacement: $1:$2
     - action: labelmap
       regex: __meta_kubernetes_service_label_(.+)
     - source_labels: [__meta_kubernetes_namespace]
       action: replace
       target_label: kubernetes_namespace
     - source_labels: [__meta_kubernetes_service_name]
       action: replace
       target_label: kubernetes_name
Create a Config Map that will be used to manage the recording and alerting rules.prometheus-rules-config-map.yml:
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: null
name: prometheus-rules-conf
namespace: monitoring
data:
kubernetes_alerts.yml: |
 groups:
   - name: kubernetes_alerts
     rules:
     - alert: DeploymentGenerationOff
       expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
       for: 5m
       labels:
         severity: warning
       annotations:
         description: Deployment generation does not match expected generation {{ $labels.namespace }}/{{ $labels.deployment }}
         summary: Deployment is outdated
     - alert: DeploymentReplicasNotUpdated
       expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
         or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
         unless (kube_deployment_spec_paused == 1)
       for: 5m
       labels:
         severity: warning
       annotations:
         description: Replicas are not updated and available for deployment {{ $labels.namespace }}/{{ $labels.deployment }}
         summary: Deployment replicas are outdated
     - alert: PodzFrequentlyRestarting
       expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
       for: 10m
       labels:
         severity: warning
       annotations:
         description: Pod {{ $labels.namespace }}/{{ $labels.pod }} was restarted {{ $value }} times within the last hour
         summary: Pod is restarting frequently
     - alert: KubeNodeNotReady
       expr: kube_node_status_condition{condition="Ready",status="true"} == 0
       for: 1h
       labels:
         severity: warning
       annotations:
         description: The Kubelet on {{ $labels.node }} has not checked in with the API,
           or has set itself to NotReady, for more than an hour
         summary: Node status is NotReady
     - alert: KubeManyNodezNotReady
       expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0)
         > 1 and (count(kube_node_status_condition{condition="Ready",status="true"} ==
         0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2
       for: 1m
       labels:
         severity: critical
       annotations:
         description: '{{ $value }}% of Kubernetes nodes are not ready'
     - alert: APIHighLatency
       expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4
       for: 10m
       labels:
         severity: critical
       annotations:
         description: the API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}
     - alert: APIServerErrorsHigh
       expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) * 100 > 5
       for: 10m
       labels:
         severity: critical
       annotations:
         description: API server returns errors for {{ $value }}% of requests
     - alert: KubernetesAPIServerDown
       expr: up{job="kubernetes-apiservers"} == 0
       for: 10m
       labels:
         severity: critical
       annotations:
         summary: Apiserver {{ $labels.instance }} is down!
     - alert: KubernetesAPIServersGone
       expr:  absent(up{job="kubernetes-apiservers"})
       for: 10m
       labels:
         severity: critical
       annotations:
         summary: No Kubernetes apiservers are reporting!
         description: Werner Heisenberg says - OMG Where are my apiserverz?
prometheus_alerts.yml: |
 groups:
 - name: prometheus_alerts
   rules:
   - alert: PrometheusConfigReloadFailed
     expr: prometheus_config_last_reload_successful == 0
     for: 10m
     labels:
       severity: warning
     annotations:
       description: Reloading Prometheus configuration has failed on {{$labels.instance}}.
   - alert: PrometheusNotConnectedToAlertmanagers
     expr: prometheus_notifications_alertmanagers_discovered < 1
     for: 1m
     labels:
       severity: warning
     annotations:
       description: Prometheus {{ $labels.instance}} is not connected to any Alertmanagers
node_alerts.yml: |
 groups:
 - name: node_alerts
   rules:
   - alert: HighNodeCPU
     expr: instance:node_cpu:avg_rate5m > 80
     for: 10s
     labels:
       severity: warning
     annotations:
       summary: High Node CPU of {{ humanize $value}}% for 1 hour
   - alert: DiskWillFillIn4Hours
     expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4*3600) < 0
     for: 5m
     labels:
       severity: critical
     annotations:
       summary: Disk on {{ $labels.instance }} will fill in approximately 4 hours.
   - alert: KubernetesServiceDown
     expr: up{job="kubernetes-service-endpoints"} == 0
     for: 10m
     labels:
       severity: critical
     annotations:
       summary: Pod {{ $labels.instance }} is down!
   - alert: KubernetesServicesGone
     expr:  absent(up{job="kubernetes-service-endpoints"})
     for: 10m
     labels:
       severity: critical
     annotations:
       summary: No Kubernetes services are reporting!
       description: Werner Heisenberg says - OMG Where are my servicez?
   - alert: CriticalServiceDown
     expr: node_systemd_unit_state{state="active"} != 1
     for: 2m
     labels:
       severity: critical
     annotations:
       summary: Service {{ $labels.name }} failed to start.
       description: Service {{ $labels.instance }} failed to (re)start service {{ $labels.name }}.
redis_alerts.yml: |
 groups:
 - name: redis_alerts
   rules:
   - alert: RedisCacheMissesHigh
     expr: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total) > 0.8
     for: 10m
     labels:
       severity: warning
     annotations:
       summary: Redis Server {{ $labels.instance }} Cache Misses are high.
   - alert: RedisRejectedConnectionsHigh
     expr: redis_connected_clients{} > 100
     for: 10m
     labels:
       severity: warning
     annotations:
       summary: "Redis instance {{ $labels.addr }} may be hitting maxclient limit."
       description: "The Redis instance at {{ $labels.addr }} had {{ $value }} rejected connections during the last 10m and may be hitting the maxclient limit."
   - alert: RedisServerDown
     expr: redis_up{app="media-redis"} == 0
     for: 10m
     labels:
       severity: critical
     annotations:
       summary: Redis Server {{ $labels.instance }} is down!
   - alert: RedisServerGone
     expr:  absent(redis_up{app="media-redis"})
     for: 1m
     labels:
       severity: critical
     annotations:
       summary: No Redis servers are reporting!
       description: Werner Heisenberg says - there is no uncertainty about the Redis server being gone.
kubernetes_rules.yml: |
 groups:
   - name: kubernetes_rules
     rules:
     - record: apiserver_latency_seconds:quantile
       expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
       labels:
         quantile: "0.99"
     - record: apiserver_latency_seconds:quantile
       expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
       labels:
         quantile: "0.9"
     - record: apiserver_latency_seconds:quantile
       expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
       labels:
         quantile: "0.5"
node_rules.yml: |
 groups:
 - name: node_rules
   rules:
     - record: instance:node_cpu:avg_rate5m
       expr: 100 - avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) by (instance) * 100
     - record: instance:node_memory_usage:percentage
       expr: ((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100
     - record: instance:root:node_filesystem_usage:percentage
       expr: (node_filesystem_size_bytes{mountpoint="/rootfs"} - node_filesystem_free_bytes{mountpoint="/rootfs"}) /node_filesystem_size_bytes{mountpoint="/rootfs"} * 100
redis_rules.yml: |
 groups:
 - name: redis_rules
   rules:
   - record: redis:command_call_duration_seconds_count:rate2m
     expr: sum(irate(redis_command_call_duration_seconds_count[2m])) by (cmd, environment)
   - record: redis:total_requests:rate2m
     expr: rate(redis_commands_processed_total[2m])
Update the volumes by the Prometheus deployment.prometheus-deployment.yml:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: prometheus-deployment
namespace: monitoring
spec:
replicas: 1
template:
 metadata:
   labels:
     app: prometheus-server
 spec:
   containers:
     - name: prometheus
       image: prom/prometheus:v2.2.1
       args:
         - "--config.file=/etc/prometheus/prometheus.yml"
         - "--storage.tsdb.path=/prometheus/"
         - "--web.enable-lifecycle"
       ports:
         - containerPort: 9090
       volumeMounts:
         - name: prometheus-config-volume
           mountPath: /etc/prometheus/
         - name: prometheus-rules-volume
           mountPath: /var/prometheus/rules
         - name: prometheus-storage-volume
           mountPath: /prometheus/
     - name: watch
       image: weaveworks/watch:master-5b2a6e5
       imagePullPolicy: IfNotPresent
       args: ["-v", "-t", "-p=/etc/prometheus", "-p=/var/prometheus", "curl", "-X", "POST", "--fail", "-o", "-", "-sS", "http://localhost:9090/-/reload"]
       volumeMounts:
         - name: prometheus-config-volume
           mountPath: /etc/prometheus
         - name: prometheus-rules-volume
           mountPath: /var/prometheus/rules
   volumes:
     - name: prometheus-config-volume
       configMap:
         defaultMode: 420
         name: prometheus-server-conf
     - name: prometheus-rules-volume
       configMap:
         name: prometheus-rules-conf
     - name: prometheus-storage-volume
       emptyDir: {}

Alerting Rules

00:06:49

Lesson Description:

In this lesson, you will learn how to create alerting rules that will be used to send alerts to Alertmanager. Below are the rules that were created in the previous lesson:

apiVersion: v1
kind: ConfigMap
metadata:
  creationTimestamp: null
  name: prometheus-rules-conf
  namespace: monitoring
data:
  kubernetes_alerts.yml: |
    groups:
      - name: kubernetes_alerts
        rules:
        - alert: DeploymentGenerationOff
          expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
          for: 5m
          labels:
            severity: warning
          annotations:
            description: Deployment generation does not match expected generation {{ $labels.namespace }}/{{ $labels.deployment }}
            summary: Deployment is outdated
        - alert: DeploymentReplicasNotUpdated
          expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
            or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
            unless (kube_deployment_spec_paused == 1)
          for: 5m
          labels:
            severity: warning
          annotations:
            description: Replicas are not updated and available for deployment {{ $labels.namespace }}/{{ $labels.deployment }}
            summary: Deployment replicas are outdated
        - alert: PodzFrequentlyRestarting
          expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
          for: 10m
          labels:
            severity: warning
          annotations:
            description: Pod {{ $labels.namespace }}/{{ $labels.pod }} was restarted {{ $value }} times within the last hour
            summary: Pod is restarting frequently
        - alert: KubeNodeNotReady
          expr: kube_node_status_condition{condition="Ready",status="true"} == 0
          for: 1h
          labels:
            severity: warning
          annotations:
            description: The Kubelet on {{ $labels.node }} has not checked in with the API,
              or has set itself to NotReady, for more than an hour
            summary: Node status is NotReady
        - alert: KubeManyNodezNotReady
          expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0)
            > 1 and (count(kube_node_status_condition{condition="Ready",status="true"} ==
            0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2
          for: 1m
          labels:
            severity: critical
          annotations:
            description: '{{ $value }}% of Kubernetes nodes are not ready'
        - alert: APIHighLatency
          expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4
          for: 10m
          labels:
            severity: critical
          annotations:
            description: the API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}
        - alert: APIServerErrorsHigh
          expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) * 100 > 5
          for: 10m
          labels:
            severity: critical
          annotations:
            description: API server returns errors for {{ $value }}% of requests
        - alert: KubernetesAPIServerDown
          expr: up{job="kubernetes-apiservers"} == 0
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: Apiserver {{ $labels.instance }} is down!
        - alert: KubernetesAPIServersGone
          expr:  absent(up{job="kubernetes-apiservers"})
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: No Kubernetes apiservers are reporting!
            description: Werner Heisenberg says - OMG Where are my apiserverz?
  prometheus_alerts.yml: |
    groups:
    - name: prometheus_alerts
      rules:
      - alert: PrometheusConfigReloadFailed
        expr: prometheus_config_last_reload_successful == 0
        for: 10m
        labels:
          severity: warning
        annotations:
          description: Reloading Prometheus configuration has failed on {{$labels.instance}}.
      - alert: PrometheusNotConnectedToAlertmanagers
        expr: prometheus_notifications_alertmanagers_discovered < 1
        for: 1m
        labels:
          severity: warning
        annotations:
          description: Prometheus {{ $labels.instance}} is not connected to any Alertmanagers
  node_alerts.yml: |
    groups:
    - name: node_alerts
      rules:
      - alert: HighNodeCPU
        expr: instance:node_cpu:avg_rate5m > 80
        for: 10s
        labels:
          severity: warning
        annotations:
          summary: High Node CPU of {{ humanize $value}}% for 1 hour
      - alert: DiskWillFillIn4Hours
        expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4*3600) < 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Disk on {{ $labels.instance }} will fill in approximately 4 hours.
      - alert: KubernetesServiceDown
        expr: up{job="kubernetes-service-endpoints"} == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: Pod {{ $labels.instance }} is down!
      - alert: KubernetesServicesGone
        expr:  absent(up{job="kubernetes-service-endpoints"})
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: No Kubernetes services are reporting!
          description: Werner Heisenberg says - OMG Where are my servicez?
      - alert: CriticalServiceDown
        expr: node_systemd_unit_state{state="active"} != 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Service {{ $labels.name }} failed to start.
          description: Service {{ $labels.instance }} failed to (re)start service {{ $labels.name }}.
  redis_alerts.yml: |
    groups:
    - name: redis_alerts
      rules:
      - alert: RedisCacheMissesHigh
        expr: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total) > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Redis Server {{ $labels.instance }} Cache Misses are high.
      - alert: RedisRejectedConnectionsHigh
        expr: redis_connected_clients{} > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Redis instance {{ $labels.addr }} may be hitting maxclient limit."
          description: "The Redis instance at {{ $labels.addr }} had {{ $value }} rejected connections during the last 10m and may be hitting the maxclient limit."
      - alert: RedisServerDown
        expr: redis_up{app="media-redis"} == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: Redis Server {{ $labels.instance }} is down!
      - alert: RedisServerGone
        expr:  absent(redis_up{app="media-redis"})
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: No Redis servers are reporting!
          description: Werner Heisenberg says - there is no uncertainty about the Redis server being gone.
  kubernetes_rules.yml: |
    groups:
      - name: kubernetes_rules
        rules:
        - record: apiserver_latency_seconds:quantile
          expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
          labels:
            quantile: "0.99"
        - record: apiserver_latency_seconds:quantile
          expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
          labels:
            quantile: "0.9"
        - record: apiserver_latency_seconds:quantile
          expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
          labels:
            quantile: "0.5"
  node_rules.yml: |
    groups:
    - name: node_rules
      rules:
        - record: instance:node_cpu:avg_rate5m
          expr: 100 - avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) by (instance) * 100
        - record: instance:node_memory_usage:percentage
          expr: ((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100
        - record: instance:root:node_filesystem_usage:percentage
          expr: (node_filesystem_size_bytes{mountpoint="/rootfs"} - node_filesystem_free_bytes{mountpoint="/rootfs"}) /node_filesystem_size_bytes{mountpoint="/rootfs"} * 100
  redis_rules.yml: |
    groups:
    - name: redis_rules
      rules:
      - record: redis:command_call_duration_seconds_count:rate2m
        expr: sum(irate(redis_command_call_duration_seconds_count[2m])) by (cmd, environment)
      - record: redis:total_requests:rate2m
        expr: rate(redis_commands_processed_total[2m])

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

01:00:00

Conclusion

Final Steps

Next Steps

00:01:05

Lesson Description:

Not sure what to take next? Maybe these courses will pique your interest.

Take this course and learn a new skill today.

Transform your learning with our all access plan.

Start 7-Day Free Trial