Skip to main content

Monitoring Kubernetes With Prometheus

Course

Intro Video

Photo of Travis Thomsen

Travis Thomsen

Course Development Director in Content

I have over 17 years of experience in all phases of the software development life cycle, which includes software analysis, design, development, testing, implementation, debugging, maintenance and documentation. I am passionate about learning new technologies, methodologies, languages and automation.

Length

05:00:00

Difficulty

Intermediate

Videos

24

Hands-on Labs

3

Course Details

Are you interested in deploying Prometheus to Kubernetes? If so, this is the course for you. This course covers the basics of Prometheus, which includes its architecture and components, such as exporters, client libraries, and alerting. From there, you will learn how to deploy Prometheus to Kubernetes and configure Prometheus to monitor the cluster as well as applications deployed to it. You will also learn the basics of PromQL, which includes the syntax, functions, and creating recording rules. Finally, the course will close out by talking about the Alertmanager and creating alerting rules. Donwload the Interactive Diagrams here: https://interactive.linuxacademy.com/diagrams/MonitoringKubernetswithPrometheus.html https://interactive.linuxacademy.com/diagrams/ApplicationMetrics.html https://interactive.linuxacademy.com/diagrams/ExporterMetrics.html https://interactive.linuxacademy.com/diagrams/NodeExporter.html

Syllabus

Introduction

About This Course

00:01:57

Lesson Description:

This video will go over the highlights of this course:* Prometheus Architecture * Run Prometheus on Kubernetes * Application Monitoring * PromQL * AlertingI will also discuss the prerequisites for this course.

About the Instructor

00:00:55

Lesson Description:

Before we get started on the course, let's learn a little about who is teaching it!

What is Prometheus?

00:01:39

Lesson Description:

Before we jump into the technical details of this course, we will take a five thousand foot view of what Prometheus is.

Setting Up Your Environment

Using Cloud Playground

00:06:16

Lesson Description:

In this video, you will learn how to use Cloud Playground to create the Cloud Servers you will need to complete this course. You will also be shown how to use the web terminal as an alternative to using SSH.

Setting Up a Kubernetes Cluster

00:07:57

Lesson Description:

In this lesson, you will setup your Kubernetes cluster. We will start by installing the Master node.#### Setting up the Kubernetes Master The following actions will be executed on the Kubernetes Master.1. Disable swap ``` swapoff -a ``` 2. Edit: `/etc/fstab` ``` vi /etc/fstab ``` 3. Comment out swap ``` #/root/swap swap swap sw 0 0 ``` 4. Add the Kubernetes repo ``` cat /etc/yum.repos.d/kubernetes.repo [kubernetes] name=Kubernetes baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64 enabled=1 gpgcheck=1 repo_gpgcheck=1 gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg exclude=kube* EOF ``` 5. Disable SELinux ``` setenforce 0 ``` 6. Permanently disable SELinux: ``` vi /etc/selinux/config ``` 7. Change enforcing to disabled ``` SELINUX=disabled ``` 8. Install Kubernetes 1.11.3 ``` yum install -y kubelet-1.11.3 kubeadm-1.11.3 kubectl-1.11.3 kubernetes-cni-0.6.0 --disableexcludes=kubernetes ``` 9. Start and enable the Kubernetes service ``` systemctl start kubelet && systemctl enable kubelet ``` 10. Create the `k8s.conf` file: ``` cat /etc/sysctl.d/k8s.conf net.bridge.bridge-nf-call-ip6tables = 1 net.bridge.bridge-nf-call-iptables = 1 EOF sysctl --system ``` 11. Create `kube-config.yml`: ``` vi kube-config.yml ``` 12. Add the following to `kube-config.yml`: ``` apiVersion: kubeadm.k8s.io/v1alpha1 kind: kubernetesVersion: "v1.11.3" networking: podSubnet: 10.244.0.0/16 apiServerExtraArgs: service-node-port-range: 8000-31274 ``` 13. Initialize Kubernetes ``` kubeadm init --config kube-config.yml ``` 14. Copy `admin.conf` to your home directory ``` mkdir -p $HOME/.kube cp -i /etc/kubernetes/admin.conf $HOME/.kube/config chown $(id -u):$(id -g) $HOME/.kube/config ``` 15. Install flannel ``` kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/v0.9.1/Documentation/kube-flannel.yml ``` 16. Patch flannel ``` vi /etc/kubernetes/manifests/kube-controller-manager.yaml ``` Add the following to kube-controller-manager.yaml: ``` --allocate-node-cidrs=true --cluster-cidr=10.244.0.0/16 ``` Then reolad kubelete ``` systemctl restart kubelet ```#### Setting up the Kubernetes WorkerNow that the setup for the Kubernetes master is complete, we will begin the process of configuring the worker node. The following actions will be executed on the Kubernetes worker.1. Disable swap ``` swapoff -a ``` 2. Edit: `/etc/fstab` ``` vi /etc/fstab ``` 3. Comment out swap ``` #/root/swap swap swap sw 0 0 ``` 4. Add the Kubernetes repo ``` cat /etc/yum.repos.d/kubernetes.repo [kubernetes] name=Kubernetes baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64 enabled=1 gpgcheck=1 repo_gpgcheck=1 gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg exclude=kube* EOF ``` 5. Disable SELinux ``` setenforce 0 ``` 6. Permanently disable SELinux: ``` vi /etc/selinux/config ``` 7. Change enforcing to disabled ``` SELINUX=disabled ``` 8. Install Kubernetes 1.11.3 ``` yum install -y kubelet-1.11.3 kubeadm-1.11.3 kubectl-1.11.3 kubernetes-cni-0.6.0 --disableexcludes=kubernetes ``` 9. Start and enable the Kubernetes service ``` systemctl start kubelet && systemctl enable kubelet ``` 10. Create the `k8s.conf` file: ``` cat /etc/sysctl.d/k8s.conf net.bridge.bridge-nf-call-ip6tables = 1 net.bridge.bridge-nf-call-iptables = 1 EOF sysctl --system ``` 11. Use the join token to add the Worker Node to the cluster: ``` kubeadm join < MASTER_IP >:6443 --token < TOKEN > --discovery-token-ca-cert-hash sha256:< HASH > ``` 12. On the master node, test to see if the cluster was created properly.13. Get a listing of the nodes: ``` kubectl get nodes ```

Prometheus Architecture

Prometheus Architecture Diagram

00:02:57

Lesson Description:

In this lesson, we will be reviewing the Prometheus Architecture Diagram and go over the various components.

Client Libraries

00:01:21

Lesson Description:

You use client libraries and instrumentation to gather metrics for Prometheus to scrape.Prometheus scrapes your application's HTTP endpoint. Client libraries send the current state of all metrics tracked to the Prometheus server.You can develop your own client library if one doesn't exist.This is the code used to instrument the app using the NodeJS library `prom-client`: ``` var Register = require('prom-client').register; var Counter = require('prom-client').Counter; var Histogram = require('prom-client').Histogram; var Summary = require('prom-client').Summary; var ResponseTime = require('response-time');module.exports.totalNumOfRequests = totalNumOfRequests = new Counter({ name: 'totalNumOfRequests', help: 'Total number of requests made', labelNames: ['method'] });module.exports.pathsTaken = pathsTaken = new Counter({ name: 'pathsTaken', help: 'Paths taken in the app', labelNames: ['path'] });module.exports.responses = responses = new Summary({ name: 'responses', help: 'Response time in millis', labelNames: ['method', 'path', 'status'] });module.exports.startCollection = function () { require('prom-client').collectDefaultMetrics(); };module.exports.requestCounters = function (req, res, next) { if (req.path != '/metrics') { totalNumOfRequests.inc({ method: req.method }); pathsTaken.inc({ path: req.path }); } next(); }module.exports.responseCounters = ResponseTime(function (req, res, time) { if(req.url != '/metrics') { responses.labels(req.method, req.url, res.statusCode).observe(time); } })module.exports.injectMetricsRoute = function (App) { App.get('/metrics', (req, res) => { res.set('Content-Type', Register.contentType); res.end(Register.metrics()); }); }; ```Prometheus supported libraries: * Go * Java or Scala * Python * RubyThird-party libraries: * Bash * C++ * Common Lisp * Elixir * Erlang * Haskell * Lua for Nginx * Lua for Tarantool * .NET / C# * Node.js * Perl * PHP * Rust

Exporters

00:01:41

Lesson Description:

Exporters are software that is deployed next to the application that you want to have metrics collected from. Instrumentation for exporters are known as **custom collectors** or **ConstMetrics**.How exporters work: * Takes requests * Gathers the data * Formats the data * Returns the data to PrometheusDatabases: * Consul exporter * Memcached exporter * MySQL server exporterHardware: * Node/system metrics exporterHTTP: * HAProxy exporterOther monitoring systems: * AWS CloudWatch exporter * Collectd exporter * Graphite exporter * InfluxDB exporter * JMX exporter * SNMP exporter * StatsD exporterMiscellaneous: * Blackbox exporter

Service Discovery

00:02:42

Lesson Description:

In this lesson you will learn about Service Discovery. Service Discovery is way for Prometheus to find targets without having to statically configure them.

Scraping

00:01:26

Lesson Description:

In this lesson, you will learn the difference between push and pull monitoring systems. We will also discuss how Prometheus defines scraping.

Run Prometheus on Kubernetes

Setting Up Prometheus

00:15:57

Lesson Description:

In this lesson, we will set up Prometheus on the Kubernetes cluster. We will be creating: * A metrics namespace for our environment to live in * A ClusterRole to give Prometheus access to targets using Service Discovery * A ConfigMap map that will be used to generate the Prometheus config file * A Prometheus Deployment and Service * Kube State Metrics to get access to metrics on the Kubernetes APIYou can clone the YAML files form [Github](https://github.com/linuxacademy/content-kubernetes-prometheus-env).Create a file called `namespaces.yml`. This file will be used to create the monitoring namespace. *namespaces.yml* ``` { "kind": "Namespace", "apiVersion": "v1", "metadata": { "name": "monitoring", "labels": { "name": "monitoring" } } } ```Apply the namespace: ``` kubectl apply -f namespaces.yml ```Create a file called `clusterRole.yml`. This will be used to set up the cluster's roles. *clusterRole.yml*:``` apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - nodes - nodes/proxy - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: - extensions resources: - ingresses verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: default namespace: monitoring ```Apply the cluster roles to the Kubernetes cluster: ``` kubectl apply -f clusterRole.yml ```Create `config-map.yml`. Kubernetes will use this file to manage the `prometheus.yml` configuration file. *config-map.yml*: ``` apiVersion: v1 kind: ConfigMap metadata: name: prometheus-server-conf labels: name: prometheus-server-conf namespace: monitoring data: prometheus.yml: |- global: scrape_interval: 5s evaluation_interval: 5sscrape_configs: - job_name: 'kubernetes-apiservers'kubernetes_sd_configs: - role: endpoints scheme: httpstls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenrelabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https- job_name: 'kubernetes-nodes'scheme: httpstls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenkubernetes_sd_configs: - role: noderelabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics- job_name: 'kubernetes-pods'kubernetes_sd_configs: - role: podrelabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name- job_name: 'kubernetes-cadvisor'scheme: httpstls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenkubernetes_sd_configs: - role: noderelabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor- job_name: 'kubernetes-service-endpoints'kubernetes_sd_configs: - role: endpointsrelabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name ```Create the ConfigMap: ``` kubectl apply -f config-map.yml ```Create `prometheus-deployment.yml`. This file will be used to create the Prometheus deployment; which will include the pods, replica sets and volumes. *prometheus-deployment.yml* ``` apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prometheus-deployment namespace: monitoring spec: replicas: 1 template: metadata: labels: app: prometheus-server spec: containers: - name: prometheus image: prom/prometheus:v2.2.1 args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus/" - ""--web.enable-lifecycle" ports: - containerPort: 9090 volumeMounts: - name: prometheus-config-volume mountPath: /etc/prometheus/ - name: prometheus-storage-volume mountPath: /prometheus/ volumes: - name: prometheus-config-volume configMap: defaultMode: 420 name: prometheus-server-conf- name: prometheus-storage-volume emptyDir: {} ```Deploy the Prometheus environment: ``` kubectl apply -f prometheus-deployment.yml ```Finally, we will finish off the Prometheus environment by creating a server to make publicly accessible. Create `prometheus-service.yml`. *prometheus-service.yml*: ``` apiVersion: v1 kind: Service metadata: name: prometheus-service namespace: monitoring annotations: prometheus.io/scrape: 'true' prometheus.io/port: '9090'spec: selector: app: prometheus-server type: NodePort ports: - port: 8080 targetPort: 9090 nodePort: 8080 ```Create the service that will make Prometheus publicly accessible: ``` kubectl apply -f prometheus-service.yml ```Create the `clusterRole.yml` file to set up access so Prometheus can access metrics using Service Discovery. *clusterRole.yml*: ``` apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - nodes - nodes/proxy - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: - extensions resources: - ingresses verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: default namespace: monitoring```Crate the Kube State Metrics pod to get access to metrics on the Kubernetes API: *kube-state-metrics.yml*: ``` apiVersion: v1 kind: Service metadata: name: kube-state-metrics namespace: monitoring labels: app: kube-state-metrics annotations: prometheus.io/scrape: 'true' spec: ports: - name: metrics port: 8080 targetPort: metrics protocol: TCP selector: app: kube-state-metrics --- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: kube-state-metrics namespace: monitoring labels: app: kube-state-metrics spec: replicas: 1 template: metadata: name: kube-state-metrics-main labels: app: kube-state-metrics spec: containers: - name: kube-state-metrics image: quay.io/coreos/kube-state-metrics:latest ports: - containerPort: 8080 name: metrics ```Access Prometheus by visiting https://<MASTER_IP>:8080

Configuring Prometheus

00:13:10

Lesson Description:

In this lesson you will learn about the Prometheus configuration file, how to configure static targets, as well as how to use service discovery to find Kubernetes endpoints. Below is the contents of prometheus.conf that was created by the Config Map.prometheus.conf: ``` global: scrape_interval: 5s evaluation_interval: 5sscrape_configs: - job_name: 'kubernetes-apiservers'kubernetes_sd_configs: - role: endpoints scheme: httpstls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenrelabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https- job_name: 'kubernetes-nodes'scheme: httpstls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenkubernetes_sd_configs: - role: noderelabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics- job_name: 'kubernetes-pods'kubernetes_sd_configs: - role: podrelabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name- job_name: 'kubernetes-cadvisor'scheme: httpstls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenkubernetes_sd_configs: - role: noderelabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor- job_name: 'kubernetes-service-endpoints'kubernetes_sd_configs: - role: endpointsrelabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name ```[Prometheus Configuration Documentation](https://prometheus.io/docs/prometheus/latest/configuration/configuration/)

Setting Up Grafana

00:04:40

Lesson Description:

In this lesson, you will learn how to deploy a Grafana pod and service to Kubernetes.Create `grafana-deployment.yml`. This file will be used to create the Grafana deployment. Be sure to change the password.*grafana-deployment.yml*: ``` apiVersion: extensions/v1beta1 kind: Deployment metadata: name: grafana namespace: monitoring labels: app: grafana component: core spec: replicas: 1 template: metadata: labels: app: grafana component: core spec: containers: - image: grafana/grafana:3.1.1 name: grafana env: - name: GF_SECURITY_ADMIN_PASSWORD value: password ports: - containerPort: 3000 volumeMounts: - name: grafana-persistent-storage mountPath: /var volumes: - name: grafana-persistent-storage emptyDir: {}```Deploy Grafana: ``` kubectl apply -f grafana-deployment.yml ```Crate `grafana-service.yml`. This file will be used to make the pod publicly accessible.*grafana-service.yml*: ``` apiVersion: v1 kind: Service metadata: name: grafana-service namespace: monitoringspec: selector: app: grafana type: NodePort ports: - port: 3000 targetPort: 3000 nodePort: 8000 ```Create the Grafana service: ``` kubectl apply -f grafana-service.yml ```

NodeExporter

00:05:13

Lesson Description:

Repeat these steps on both your master and worker nodes.Create the Prometheus user: ``` adduser prometheus ```Download Node Exporter: ``` cd /home/prometheus curl -LO "https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.linux-amd64.tar.gz" tar -xvzf node_exporter-0.16.0.linux-amd64.tar.gz mv node_exporter-0.16.0.linux-amd64 node_exporter cd node_exporter chown prometheus:prometheus node_exporter `````` vi /etc/systemd/system/node_exporter.service ```*/etc/systemd/system/node_exporter.service*: ``` [Unit] Description=Node Exporter[Service] User=prometheus ExecStart=/home/prometheus/node_exporter/node_exporter[Install] WantedBy=default.target ```Reload systemd: ``` systemctl daemon-reload ```Enable the node_exporter service: ``` systemctl enable node_exporter.service ```Start the node_exporter service: ``` systemctl start node_exporter.service ```

Expression Browser

00:04:40

Lesson Description:

In this lesson, you will learn how to use the Expression browser to execute queries, view your Prometheus configuration, and Prometheus targets.Container CPU load average: ``` container_cpu_load_average_10s ```Memory usage query: ``` ((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100 ```

Adding a Grafana Dashboard

00:03:54

Lesson Description:

In this lesson, you will import a Grafana Dashboard that will be used to visualize metrics imported from the NodeExporter. Below are the links to the dashboard.[Content Kubernetes Prometheus Env Repository](https://github.com/linuxacademy/content-kubernetes-prometheus-env)[Kubernetes Nodes Dashboard](https://github.com/linuxacademy/content-kubernetes-prometheus-env/blob/master/grafana/dashboard/Kubernetes%20All%20Nodes.json)

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

01:00:00

Application Monitoring

Instrumenting Applications

00:05:37

Lesson Description:

This lesson discusses how to instrument an application by using a Prometheus client library. Though we will be talking about a NodeJS application, there are client libraries available for a wide variety of programming languages.You can clone the [Comic Box App here](https://github.com/linuxacademy/content-kubernetes-prometheus-app).

Collecting Metrics from Applications

00:06:08

Lesson Description:

In this lesson, you will deploy a NodeJS application to Kubernets that will be monitored by Prometheus.Github Link: https://github.com/linuxacademy/content-kubernetes-prometheus-appBuild a Docker image: ``` docker build -t rivethead42/comicbox . ```Login to Docker Hub: ``` docker login ```Push the image to Docker Hub: ``` docker push < USERNAME >/comicbox ```Create a deployment using the image above: ``` kubectl apply -f deployment.yml ```

PromQL

PromQL Basics

00:04:42

Lesson Description:

In this lesson, you will learn the basics of Prometheus' query language—PromQL. This includes queries using the metric name and then filtering it using labels.Return all time series with the metric `node_cpu_seconds_total`: ``` node_cpu_seconds_total ```Return all time series with the metric `node_cpu_seconds_total` and the given job and mode labels: ``` node_cpu_seconds_total{job="node-exporter", mode="idle"} ```Return a whole range of time (in this case 5 minutes) for the same vector, making it a range vector: ``` node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m] ```Query job that end with `-exporter`: ``` node_cpu_seconds_total{job=~".*-exporter"} ```Query job that begins with `kube`: ``` container_cpu_load_average_10s{job=~"^kube.*"} ```

PromQL Operations and Functions

00:03:32

Lesson Description:

In this lesson, you will learn how to add operations and functions to your PromQL expressions.Arithmetic binary operators: * `+` (addition) * `-` (subtraction) * `*` (multiplication) * `/` (division) * `%` (modulo) * `^` (power/exponentiation)Comparison binary operators: * `==` (equal) * `!=` (not-equal) * `>` (greater-than) * `=` (greater-or-equal) * `

Recording Rules

00:07:03

Lesson Description:

1. Create `prometheus-read-rules-map.yml`. This file will be used to create a recording rule for Prometheus. *prometheus-read-rules-map.yml*: ``` apiVersion: v1 kind: ConfigMap metadata: name: prometheus-read-rules-conf labels: name: prometheus-read-rules-conf namespace: monitoring data: node_rules.yml: |- groups: - name: node_rules interval: 10s rules: - record: instance:node_cpu:avg_rate5m expr: 100 - avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) by (instance) * 100 - record: instance:node_memory_usage:percentage expr: ((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100 - record: instance:root:node_filesystem_usage:percentage expr: (node_filesystem_size_bytes{mountpoint="/rootfs"} - node_filesystem_free_bytes{mountpoint="/rootfs"}) /node_filesystem_size_bytes{mountpoint="/rootfs"} * 100 ```2. Apply the recording rule: ``` kubectl apply -f prometheus-read-rules-map.yml ```3. Update the `prometheus-config-map.yml` with record rules. *prometheus-config-map.yml*: ``` apiVersion: v1 kind: ConfigMap metadata: name: prometheus-server-conf labels: name: prometheus-server-conf namespace: monitoring data: prometheus.yml: |- global: scrape_interval: 5s evaluation_interval: 5srule_files: - rules/*_rules.ymlscrape_configs: - job_name: 'node-exporter' static_configs: - targets: [':9100', ':9100']- job_name: 'kubernetes-apiservers'kubernetes_sd_configs: - role: endpoints scheme: httpstls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenrelabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https- job_name: 'kubernetes-nodes'scheme: httpstls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenkubernetes_sd_configs: - role: noderelabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics- job_name: 'kubernetes-pods'kubernetes_sd_configs: - role: podrelabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name- job_name: 'kubernetes-cadvisor'scheme: httpstls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenkubernetes_sd_configs: - role: noderelabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor- job_name: 'kubernetes-service-endpoints'kubernetes_sd_configs: - role: endpointsrelabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name ```4. Apply the update configuration file: ``` kubectl apply -f prometheus-config-map.yml ```5. Add a new volume for the recording rules: ``` apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prometheus-deployment namespace: monitoring spec: replicas: 1 template: metadata: labels: app: prometheus-server spec: containers: - name: prometheus image: prom/prometheus:v2.2.1 args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus/" - "--web.enable-lifecycle" ports: - containerPort: 9090 volumeMounts: - name: prometheus-config-volume mountPath: /etc/prometheus/ - name: prometheus-storage-volume mountPath: /prometheus/ - name: prometheus-read-rules-volume mountPath: /etc/prometheus/rules - name: watch image: weaveworks/watch:master-5b2a6e5 imagePullPolicy: IfNotPresent args: ["-v", "-t", "-p=/etc/prometheus", "-p=/var/prometheus", "curl", "-X", "POST", "--fail", "-o", "-", "-sS", "http://localhost:9090/-/reload"] volumeMounts: - name: prometheus-config-volume mountPath: /etc/prometheus volumes: - name: prometheus-config-volume configMap: defaultMode: 420 name: prometheus-server-conf- name: prometheus-read-rules-volume configMap: defaultMode: 420 name: prometheus-read-rules-conf- name: prometheus-storage-volume emptyDir: {} ```6. Apply the updates to the Prometheus deployment: ``` kubectl apply -f prometheus-deployment.yml ```

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

01:00:00

Alerting

Alertmanager

00:12:28

Lesson Description:

In this lesson, you will learn how to set up Alertmanager to work with Prometheus. Below are the files that will be used to complete this task:1. Create a Config Map that will be used to set up the Alertmanager config file. *alertmanager-configmap.yml*: ``` apiVersion: v1 kind: ConfigMap metadata: name: alertmanager-conf labels: name: alertmanager-conf namespace: monitoring data: alertmanager.yml: | global: smtp_smarthost: 'localhost:25' smtp_from: 'alertmanager@linuxacademy.org' smtp_require_tls: false route: receiver: slack_receiver receivers: - name: slack_receiver slack_configs: - send_resolved: true username: '' api_url: '' channel: '#' ```2. Create a deployment file that will be used to stand up the Alertmanager deployment. *alertmanager-depoloyment.yml*: ``` apiVersion: extensions/v1beta1 kind: Deployment metadata: name: alertmanager namespace: monitoring spec: replicas: 1 template: metadata: labels: app: alertmanager spec: containers: - name: prometheus-alertmanager image: prom/alertmanager:v0.14.0 args: - --config.file=/etc/config/alertmanager.yml - --storage.path=/data - --web.external-url=/ ports: - containerPort: 9093 volumeMounts: - mountPath: /etc/config name: config-volume - mountPath: /data name: storage-volume - name: prometheus-alertmanager-configmap-reload image: jimmidyson/configmap-reload:v0.1 args: - --volume-dir=/etc/config - --webhook-url=http://localhost:9093/-/reload volumeMounts: - mountPath: /etc/config name: config-volume readOnly: true volumes: - configMap: defaultMode: 420 name: alertmanager-conf name: config-volume - emptyDir: {} name: storage-volume ```*alertmanager-service.yml*: ``` apiVersion: v1 kind: Service metadata: name: alertmanager namespace: monitoring labels: app: alertmanager annotations: prometheus.io/scrape: 'true' prometheus.io/port: '9093' spec: selector: app: alertmanager type: NodePort ports: - port: 9093 targetPort: 9093 nodePort: 8081 ```3. Update the Prometheus config to include changes to rules and add the Alertmanager. *prometheus-config-map.yml*: ``` apiVersion: v1 kind: ConfigMap metadata: name: prometheus-server-conf labels: name: prometheus-server-conf namespace: monitoring data: prometheus.yml: |- global: scrape_interval: 5s evaluation_interval: 5salerting: alertmanagers: - kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_name] regex: alertmanager action: keep - source_labels: [__meta_kubernetes_namespace] regex: monitoring action: keep - source_labels: [__meta_kubernetes_pod_container_port_number] action: keep regex: 9093rule_files: - "/var/prometheus/rules/*_rules.yml" - "/var/prometheus/rules/*_alerts.yml"scrape_configs: - job_name: 'node-exporter' static_configs: - targets: [':9100', ':9100']- job_name: 'kubernetes-apiservers'kubernetes_sd_configs: - role: endpoints scheme: httpstls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenrelabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https- job_name: 'kubernetes-nodes'scheme: httpstls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenkubernetes_sd_configs: - role: noderelabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics- job_name: 'kubernetes-pods'kubernetes_sd_configs: - role: podrelabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name- job_name: 'kubernetes-cadvisor'scheme: httpstls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/tokenkubernetes_sd_configs: - role: noderelabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor- job_name: 'kubernetes-service-endpoints'kubernetes_sd_configs: - role: endpointsrelabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name ```4. Create a Config Map that will be used to manage the recording and alerting rules. *prometheus-rules-config-map.yml*: ``` apiVersion: v1 kind: ConfigMap metadata: creationTimestamp: null name: prometheus-rules-conf namespace: monitoring data: kubernetes_alerts.yml: | groups: - name: kubernetes_alerts rules: - alert: DeploymentGenerationOff expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation for: 5m labels: severity: warning annotations: description: Deployment generation does not match expected generation {{ $labels.namespace }}/{{ $labels.deployment }} summary: Deployment is outdated - alert: DeploymentReplicasNotUpdated expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas) or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas)) unless (kube_deployment_spec_paused == 1) for: 5m labels: severity: warning annotations: description: Replicas are not updated and available for deployment {{ $labels.namespace }}/{{ $labels.deployment }} summary: Deployment replicas are outdated - alert: PodzFrequentlyRestarting expr: increase(kube_pod_container_status_restarts_total[1h]) > 5 for: 10m labels: severity: warning annotations: description: Pod {{ $labels.namespace }}/{{ $labels.pod }} was restarted {{ $value }} times within the last hour summary: Pod is restarting frequently - alert: KubeNodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 1h labels: severity: warning annotations: description: The Kubelet on {{ $labels.node }} has not checked in with the API, or has set itself to NotReady, for more than an hour summary: Node status is NotReady - alert: KubeManyNodezNotReady expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0) > 1 and (count(kube_node_status_condition{condition="Ready",status="true"} == 0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2 for: 1m labels: severity: critical annotations: description: '{{ $value }}% of Kubernetes nodes are not ready' - alert: APIHighLatency expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4 for: 10m labels: severity: critical annotations: description: the API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }} - alert: APIServerErrorsHigh expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) * 100 > 5 for: 10m labels: severity: critical annotations: description: API server returns errors for {{ $value }}% of requests - alert: KubernetesAPIServerDown expr: up{job="kubernetes-apiservers"} == 0 for: 10m labels: severity: critical annotations: summary: Apiserver {{ $labels.instance }} is down! - alert: KubernetesAPIServersGone expr: absent(up{job="kubernetes-apiservers"}) for: 10m labels: severity: critical annotations: summary: No Kubernetes apiservers are reporting! description: Werner Heisenberg says - OMG Where are my apiserverz? prometheus_alerts.yml: | groups: - name: prometheus_alerts rules: - alert: PrometheusConfigReloadFailed expr: prometheus_config_last_reload_successful == 0 for: 10m labels: severity: warning annotations: description: Reloading Prometheus configuration has failed on {{$labels.instance}}. - alert: PrometheusNotConnectedToAlertmanagers expr: prometheus_notifications_alertmanagers_discovered < 1 for: 1m labels: severity: warning annotations: description: Prometheus {{ $labels.instance}} is not connected to any Alertmanagers node_alerts.yml: | groups: - name: node_alerts rules: - alert: HighNodeCPU expr: instance:node_cpu:avg_rate5m > 80 for: 10s labels: severity: warning annotations: summary: High Node CPU of {{ humanize $value}}% for 1 hour - alert: DiskWillFillIn4Hours expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4*3600) < 0 for: 5m labels: severity: critical annotations: summary: Disk on {{ $labels.instance }} will fill in approximately 4 hours. - alert: KubernetesServiceDown expr: up{job="kubernetes-service-endpoints"} == 0 for: 10m labels: severity: critical annotations: summary: Pod {{ $labels.instance }} is down! - alert: KubernetesServicesGone expr: absent(up{job="kubernetes-service-endpoints"}) for: 10m labels: severity: critical annotations: summary: No Kubernetes services are reporting! description: Werner Heisenberg says - OMG Where are my servicez? - alert: CriticalServiceDown expr: node_systemd_unit_state{state="active"} != 1 for: 2m labels: severity: critical annotations: summary: Service {{ $labels.name }} failed to start. description: Service {{ $labels.instance }} failed to (re)start service {{ $labels.name }}. redis_alerts.yml: | groups: - name: redis_alerts rules: - alert: RedisCacheMissesHigh expr: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total) > 0.8 for: 10m labels: severity: warning annotations: summary: Redis Server {{ $labels.instance }} Cache Misses are high. - alert: RedisRejectedConnectionsHigh expr: redis_connected_clients{} > 100 for: 10m labels: severity: warning annotations: summary: "Redis instance {{ $labels.addr }} may be hitting maxclient limit." description: "The Redis instance at {{ $labels.addr }} had {{ $value }} rejected connections during the last 10m and may be hitting the maxclient limit." - alert: RedisServerDown expr: redis_up{app="media-redis"} == 0 for: 10m labels: severity: critical annotations: summary: Redis Server {{ $labels.instance }} is down! - alert: RedisServerGone expr: absent(redis_up{app="media-redis"}) for: 1m labels: severity: critical annotations: summary: No Redis servers are reporting! description: Werner Heisenberg says - there is no uncertainty about the Redis server being gone. kubernetes_rules.yml: | groups: - name: kubernetes_rules rules: - record: apiserver_latency_seconds:quantile expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) / 1e+06 labels: quantile: "0.99" - record: apiserver_latency_seconds:quantile expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) / 1e+06 labels: quantile: "0.9" - record: apiserver_latency_seconds:quantile expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) / 1e+06 labels: quantile: "0.5" node_rules.yml: | groups: - name: node_rules rules: - record: instance:node_cpu:avg_rate5m expr: 100 - avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) by (instance) * 100 - record: instance:node_memory_usage:percentage expr: ((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100 - record: instance:root:node_filesystem_usage:percentage expr: (node_filesystem_size_bytes{mountpoint="/rootfs"} - node_filesystem_free_bytes{mountpoint="/rootfs"}) /node_filesystem_size_bytes{mountpoint="/rootfs"} * 100 redis_rules.yml: | groups: - name: redis_rules rules: - record: redis:command_call_duration_seconds_count:rate2m expr: sum(irate(redis_command_call_duration_seconds_count[2m])) by (cmd, environment) - record: redis:total_requests:rate2m expr: rate(redis_commands_processed_total[2m]) ```5. Update the volumes by the Prometheus deployment. *prometheus-deployment.yml*: ``` apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prometheus-deployment namespace: monitoring spec: replicas: 1 template: metadata: labels: app: prometheus-server spec: containers: - name: prometheus image: prom/prometheus:v2.2.1 args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus/" - "--web.enable-lifecycle" ports: - containerPort: 9090 volumeMounts: - name: prometheus-config-volume mountPath: /etc/prometheus/ - name: prometheus-rules-volume mountPath: /var/prometheus/rules - name: prometheus-storage-volume mountPath: /prometheus/ - name: watch image: weaveworks/watch:master-5b2a6e5 imagePullPolicy: IfNotPresent args: ["-v", "-t", "-p=/etc/prometheus", "-p=/var/prometheus", "curl", "-X", "POST", "--fail", "-o", "-", "-sS", "http://localhost:9090/-/reload"] volumeMounts: - name: prometheus-config-volume mountPath: /etc/prometheus - name: prometheus-rules-volume mountPath: /var/prometheus/rules volumes: - name: prometheus-config-volume configMap: defaultMode: 420 name: prometheus-server-conf - name: prometheus-rules-volume configMap: name: prometheus-rules-conf - name: prometheus-storage-volume emptyDir: {} ```

Alerting Rules

00:06:49

Lesson Description:

In this lesson, you will learn how to create alerting rules that will be used to send alerts to Alertmanager.Below are the rules that were created in the previous lesson: ``` apiVersion: v1 kind: ConfigMap metadata: creationTimestamp: null name: prometheus-rules-conf namespace: monitoring data: kubernetes_alerts.yml: | groups: - name: kubernetes_alerts rules: - alert: DeploymentGenerationOff expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation for: 5m labels: severity: warning annotations: description: Deployment generation does not match expected generation {{ $labels.namespace }}/{{ $labels.deployment }} summary: Deployment is outdated - alert: DeploymentReplicasNotUpdated expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas) or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas)) unless (kube_deployment_spec_paused == 1) for: 5m labels: severity: warning annotations: description: Replicas are not updated and available for deployment {{ $labels.namespace }}/{{ $labels.deployment }} summary: Deployment replicas are outdated - alert: PodzFrequentlyRestarting expr: increase(kube_pod_container_status_restarts_total[1h]) > 5 for: 10m labels: severity: warning annotations: description: Pod {{ $labels.namespace }}/{{ $labels.pod }} was restarted {{ $value }} times within the last hour summary: Pod is restarting frequently - alert: KubeNodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 1h labels: severity: warning annotations: description: The Kubelet on {{ $labels.node }} has not checked in with the API, or has set itself to NotReady, for more than an hour summary: Node status is NotReady - alert: KubeManyNodezNotReady expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0) > 1 and (count(kube_node_status_condition{condition="Ready",status="true"} == 0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2 for: 1m labels: severity: critical annotations: description: '{{ $value }}% of Kubernetes nodes are not ready' - alert: APIHighLatency expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4 for: 10m labels: severity: critical annotations: description: the API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }} - alert: APIServerErrorsHigh expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) * 100 > 5 for: 10m labels: severity: critical annotations: description: API server returns errors for {{ $value }}% of requests - alert: KubernetesAPIServerDown expr: up{job="kubernetes-apiservers"} == 0 for: 10m labels: severity: critical annotations: summary: Apiserver {{ $labels.instance }} is down! - alert: KubernetesAPIServersGone expr: absent(up{job="kubernetes-apiservers"}) for: 10m labels: severity: critical annotations: summary: No Kubernetes apiservers are reporting! description: Werner Heisenberg says - OMG Where are my apiserverz? prometheus_alerts.yml: | groups: - name: prometheus_alerts rules: - alert: PrometheusConfigReloadFailed expr: prometheus_config_last_reload_successful == 0 for: 10m labels: severity: warning annotations: description: Reloading Prometheus configuration has failed on {{$labels.instance}}. - alert: PrometheusNotConnectedToAlertmanagers expr: prometheus_notifications_alertmanagers_discovered < 1 for: 1m labels: severity: warning annotations: description: Prometheus {{ $labels.instance}} is not connected to any Alertmanagers node_alerts.yml: | groups: - name: node_alerts rules: - alert: HighNodeCPU expr: instance:node_cpu:avg_rate5m > 80 for: 10s labels: severity: warning annotations: summary: High Node CPU of {{ humanize $value}}% for 1 hour - alert: DiskWillFillIn4Hours expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4*3600) < 0 for: 5m labels: severity: critical annotations: summary: Disk on {{ $labels.instance }} will fill in approximately 4 hours. - alert: KubernetesServiceDown expr: up{job="kubernetes-service-endpoints"} == 0 for: 10m labels: severity: critical annotations: summary: Pod {{ $labels.instance }} is down! - alert: KubernetesServicesGone expr: absent(up{job="kubernetes-service-endpoints"}) for: 10m labels: severity: critical annotations: summary: No Kubernetes services are reporting! description: Werner Heisenberg says - OMG Where are my servicez? - alert: CriticalServiceDown expr: node_systemd_unit_state{state="active"} != 1 for: 2m labels: severity: critical annotations: summary: Service {{ $labels.name }} failed to start. description: Service {{ $labels.instance }} failed to (re)start service {{ $labels.name }}. redis_alerts.yml: | groups: - name: redis_alerts rules: - alert: RedisCacheMissesHigh expr: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total) > 0.8 for: 10m labels: severity: warning annotations: summary: Redis Server {{ $labels.instance }} Cache Misses are high. - alert: RedisRejectedConnectionsHigh expr: redis_connected_clients{} > 100 for: 10m labels: severity: warning annotations: summary: "Redis instance {{ $labels.addr }} may be hitting maxclient limit." description: "The Redis instance at {{ $labels.addr }} had {{ $value }} rejected connections during the last 10m and may be hitting the maxclient limit." - alert: RedisServerDown expr: redis_up{app="media-redis"} == 0 for: 10m labels: severity: critical annotations: summary: Redis Server {{ $labels.instance }} is down! - alert: RedisServerGone expr: absent(redis_up{app="media-redis"}) for: 1m labels: severity: critical annotations: summary: No Redis servers are reporting! description: Werner Heisenberg says - there is no uncertainty about the Redis server being gone. kubernetes_rules.yml: | groups: - name: kubernetes_rules rules: - record: apiserver_latency_seconds:quantile expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) / 1e+06 labels: quantile: "0.99" - record: apiserver_latency_seconds:quantile expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) / 1e+06 labels: quantile: "0.9" - record: apiserver_latency_seconds:quantile expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) / 1e+06 labels: quantile: "0.5" node_rules.yml: | groups: - name: node_rules rules: - record: instance:node_cpu:avg_rate5m expr: 100 - avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) by (instance) * 100 - record: instance:node_memory_usage:percentage expr: ((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100 - record: instance:root:node_filesystem_usage:percentage expr: (node_filesystem_size_bytes{mountpoint="/rootfs"} - node_filesystem_free_bytes{mountpoint="/rootfs"}) /node_filesystem_size_bytes{mountpoint="/rootfs"} * 100 redis_rules.yml: | groups: - name: redis_rules rules: - record: redis:command_call_duration_seconds_count:rate2m expr: sum(irate(redis_command_call_duration_seconds_count[2m])) by (cmd, environment) - record: redis:total_requests:rate2m expr: rate(redis_commands_processed_total[2m]) ```

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

01:00:00

Final Steps

Next Steps

00:01:05

Lesson Description:

Not sure what to take next? Maybe these courses will pique your interest.

Take this course and learn a new skill today.

Transform your learning with our all access plan.

Start 7-Day Free Trial