Monitoring Kubernetes With Prometheus

Course

January 21st, 2019

Intro Video

Photo of Travis Thomsen

Travis Thomsen

Course Development Director in Content

I have over 17 years of experience in all phases of the software development life cycle, which includes software analysis, design, development, testing, implementation, debugging, maintenance and documentation. I am passionate about learning new technologies, methodologies, languages and automation.

Length

05:04:05

Difficulty

Intermediate

Course Details

Are you interested in deploying Prometheus to Kubernetes? If so, this is the course for you.

This course covers the basics of Prometheus, which includes its architecture and components, such as exporters, client libraries, and alerting.

From there, you will learn how to deploy Prometheus to Kubernetes and configure Prometheus to monitor the cluster as well as applications deployed to it.

You will also learn the basics of PromQL, which includes the syntax, functions, and creating recording rules.

Finally, the course will close out by talking about the Alertmanager and creating alerting rules.

Donwload the Interactive Diagrams here:

https://interactive.linuxacademy.com/diagrams/MonitoringKubernetswithPrometheus.html

https://interactive.linuxacademy.com/diagrams/ApplicationMetrics.html

https://interactive.linuxacademy.com/diagrams/ExporterMetrics.html

https://interactive.linuxacademy.com/diagrams/NodeExporter.html

Syllabus

Introduction

Introduction

About This Course

00:01:57

Lesson Description:

This video will go over the highlights of this course: Prometheus ArchitectureRun Prometheus on KubernetesApplication MonitoringPromQLAlerting I will also discuss the prerequisites for this course.

About the Instructor

00:00:55

Lesson Description:

Before we get started on the course, let's learn a little about who is teaching it!

What is Prometheus?

00:01:39

Lesson Description:

Before we jump into the technical details of this course, we will take a five thousand foot view of what Prometheus is.

Setting Up Your Environment

Using Cloud Playground

00:06:16

Lesson Description:

In this video, you will learn how to use Cloud Playground to create the Cloud Servers you will need to complete this course. You will also be shown how to use the web terminal as an alternative to using SSH.

Setting Up a Kubernetes Cluster

00:07:57

Lesson Description:

In this lesson, you will setup your Kubernetes cluster. We will start by installing the Master node. Setting up the Kubernetes Master The following actions will be executed on the Kubernetes Master. Disable swapswapoff -a Edit: /etc/fstabvi /etc/fstab Comment out swap#/root/swap swap swap sw 0 0 Add the Kubernetes repocat << EOF > /etc/yum.repos.d/kubernetes.repo [kubernetes] name=Kubernetes baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64 enabled=1 gpgcheck=1 repo_gpgcheck=1 gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg exclude=kube* EOF Disable SELinuxsetenforce 0 Permanently disable SELinux:vi /etc/selinux/config Change enforcing to disabledSELINUX=disabled Install Kubernetes 1.11.3yum install -y kubelet-1.11.3 kubeadm-1.11.3 kubectl-1.11.3 kubernetes-cni-0.6.0 --disableexcludes=kubernetes Start and enable the Kubernetes servicesystemctl start kubelet && systemctl enable kubelet Create the k8s.conf file:cat << EOF > /etc/sysctl.d/k8s.conf net.bridge.bridge-nf-call-ip6tables = 1 net.bridge.bridge-nf-call-iptables = 1 EOF sysctl --system Create kube-config.yml:vi kube-config.yml Add the following to kube-config.yml:apiVersion: kubeadm.k8s.io/v1alpha1 kind: kubernetesVersion: "v1.11.3" networking: podSubnet: 10.244.0.0/16 apiServerExtraArgs: service-node-port-range: 8000-31274 Initialize Kuberneteskubeadm init --config kube-config.yml Copy admin.conf to your home directorymkdir -p $HOME/.kube cp -i /etc/kubernetes/admin.conf $HOME/.kube/config chown $(id -u):$(id -g) $HOME/.kube/config Install flannelkubectl apply -f https://raw.githubusercontent.com/coreos/flannel/v0.9.1/Documentation/kube-flannel.yml Patch flannelvi /etc/kubernetes/manifests/kube-controller-manager.yaml Add the following to kube-controller-manager.yaml:--allocate-node-cidrs=true --cluster-cidr=10.244.0.0/16 Then reolad kubeletesystemctl restart kubelet Setting up the Kubernetes Worker Now that the setup for the Kubernetes master is complete, we will begin the process of configuring the worker node. The following actions will be executed on the Kubernetes worker. Disable swapswapoff -a Edit: /etc/fstabvi /etc/fstab Comment out swap#/root/swap swap swap sw 0 0 Add the Kubernetes repocat << EOF > /etc/yum.repos.d/kubernetes.repo [kubernetes] name=Kubernetes baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64 enabled=1 gpgcheck=1 repo_gpgcheck=1 gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg exclude=kube* EOF Disable SELinuxsetenforce 0 Permanently disable SELinux:vi /etc/selinux/config Change enforcing to disabledSELINUX=disabled Install Kubernetes 1.11.3yum install -y kubelet-1.11.3 kubeadm-1.11.3 kubectl-1.11.3 kubernetes-cni-0.6.0 --disableexcludes=kubernetes Start and enable the Kubernetes servicesystemctl start kubelet && systemctl enable kubelet Create the k8s.conf file:cat << EOF > /etc/sysctl.d/k8s.conf net.bridge.bridge-nf-call-ip6tables = 1 net.bridge.bridge-nf-call-iptables = 1 EOF sysctl --system Use the join token to add the Worker Node to the cluster:kubeadm join < MASTER_IP >:6443 --token < TOKEN > --discovery-token-ca-cert-hash sha256:< HASH > On the master node, test to see if the cluster was created properly. Get a listing of the nodes: kubectl get nodes

Monitoring Kubernetes with Prometheus

Prometheus Architecture

Prometheus Architecture Diagram

00:02:57

Lesson Description:

In this lesson, we will be reviewing the Prometheus Architecture Diagram and go over the various components.

Client Libraries

00:01:21

Lesson Description:

You use client libraries and instrumentation to gather metrics for Prometheus to scrape. Prometheus scrapes your application's HTTP endpoint. Client libraries send the current state of all metrics tracked to the Prometheus server. You can develop your own client library if one doesn't exist. This is the code used to instrument the app using the NodeJS library prom-client: var Register = require('prom-client').register; var Counter = require('prom-client').Counter; var Histogram = require('prom-client').Histogram; var Summary = require('prom-client').Summary; var ResponseTime = require('response-time'); module.exports.totalNumOfRequests = totalNumOfRequests = new Counter({ name: 'totalNumOfRequests', help: 'Total number of requests made', labelNames: ['method'] }); module.exports.pathsTaken = pathsTaken = new Counter({ name: 'pathsTaken', help: 'Paths taken in the app', labelNames: ['path'] }); module.exports.responses = responses = new Summary({ name: 'responses', help: 'Response time in millis', labelNames: ['method', 'path', 'status'] }); module.exports.startCollection = function () { require('prom-client').collectDefaultMetrics(); }; module.exports.requestCounters = function (req, res, next) { if (req.path != '/metrics') { totalNumOfRequests.inc({ method: req.method }); pathsTaken.inc({ path: req.path }); } next(); } module.exports.responseCounters = ResponseTime(function (req, res, time) { if(req.url != '/metrics') { responses.labels(req.method, req.url, res.statusCode).observe(time); } }) module.exports.injectMetricsRoute = function (App) { App.get('/metrics', (req, res) => { res.set('Content-Type', Register.contentType); res.end(Register.metrics()); }); }; Prometheus supported libraries: GoJava or ScalaPythonRuby Third-party libraries: BashC++Common LispElixirErlangHaskellLua for NginxLua for Tarantool.NET / C#Node.jsPerlPHPRust

Exporters

00:01:41

Lesson Description:

Exporters are software that is deployed next to the application that you want to have metrics collected from. Instrumentation for exporters are known as custom collectors or ConstMetrics. How exporters work: Takes requestsGathers the dataFormats the dataReturns the data to Prometheus Databases: Consul exporterMemcached exporterMySQL server exporter Hardware: Node/system metrics exporter HTTP: HAProxy exporter Other monitoring systems: AWS CloudWatch exporterCollectd exporterGraphite exporterInfluxDB exporterJMX exporterSNMP exporterStatsD exporter Miscellaneous: Blackbox exporter

Service Discovery

00:02:42

Lesson Description:

In this lesson you will learn about Service Discovery. Service Discovery is way for Prometheus to find targets without having to statically configure them.

Scraping

00:01:26

Lesson Description:

In this lesson, you will learn the difference between push and pull monitoring systems. We will also discuss how Prometheus defines scraping.

Run Prometheus on Kubernetes

Setting Up Prometheus

00:15:57

Lesson Description:

In this lesson, we will set up Prometheus on the Kubernetes cluster. We will be creating: A metrics namespace for our environment to live inA ClusterRole to give Prometheus access to targets using Service DiscoveryA ConfigMap map that will be used to generate the Prometheus config fileA Prometheus Deployment and ServiceKube State Metrics to get access to metrics on the Kubernetes API You can clone the YAML files form Github. Create a file called namespaces.yml. This file will be used to create the monitoring namespace.namespaces.yml { "kind": "Namespace", "apiVersion": "v1", "metadata": { "name": "monitoring", "labels": { "name": "monitoring" } } } Apply the namespace: kubectl apply -f namespaces.yml Create a file called clusterRole.yml. This will be used to set up the cluster's roles.clusterRole.yml: apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - nodes - nodes/proxy - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: - extensions resources: - ingresses verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: default namespace: monitoring Apply the cluster roles to the Kubernetes cluster: kubectl apply -f clusterRole.yml Create config-map.yml. Kubernetes will use this file to manage the prometheus.yml configuration file.config-map.yml: apiVersion: v1 kind: ConfigMap metadata: name: prometheus-server-conf labels: name: prometheus-server-conf namespace: monitoring data: prometheus.yml: |- global: scrape_interval: 5s evaluation_interval: 5s scrape_configs: - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-nodes' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - job_name: 'kubernetes-cadvisor' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name Create the ConfigMap: kubectl apply -f config-map.yml Create prometheus-deployment.yml. This file will be used to create the Prometheus deployment; which will include the pods, replica sets and volumes.prometheus-deployment.yml apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prometheus-deployment namespace: monitoring spec: replicas: 1 template: metadata: labels: app: prometheus-server spec: containers: - name: prometheus image: prom/prometheus:v2.2.1 args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus/" - ""--web.enable-lifecycle" ports: - containerPort: 9090 volumeMounts: - name: prometheus-config-volume mountPath: /etc/prometheus/ - name: prometheus-storage-volume mountPath: /prometheus/ volumes: - name: prometheus-config-volume configMap: defaultMode: 420 name: prometheus-server-conf - name: prometheus-storage-volume emptyDir: {} Deploy the Prometheus environment: kubectl apply -f prometheus-deployment.yml Finally, we will finish off the Prometheus environment by creating a server to make publicly accessible. Create prometheus-service.yml.prometheus-service.yml: apiVersion: v1 kind: Service metadata: name: prometheus-service namespace: monitoring annotations: prometheus.io/scrape: 'true' prometheus.io/port: '9090' spec: selector: app: prometheus-server type: NodePort ports: - port: 8080 targetPort: 9090 nodePort: 8080 Create the service that will make Prometheus publicly accessible: kubectl apply -f prometheus-service.yml Create the clusterRole.yml file to set up access so Prometheus can access metrics using Service Discovery.clusterRole.yml: apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - nodes - nodes/proxy - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: - extensions resources: - ingresses verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: default namespace: monitoring Crate the Kube State Metrics pod to get access to metrics on the Kubernetes API:kube-state-metrics.yml: apiVersion: v1 kind: Service metadata: name: kube-state-metrics namespace: monitoring labels: app: kube-state-metrics annotations: prometheus.io/scrape: 'true' spec: ports: - name: metrics port: 8080 targetPort: metrics protocol: TCP selector: app: kube-state-metrics --- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: kube-state-metrics namespace: monitoring labels: app: kube-state-metrics spec: replicas: 1 template: metadata: name: kube-state-metrics-main labels: app: kube-state-metrics spec: containers: - name: kube-state-metrics image: quay.io/coreos/kube-state-metrics:latest ports: - containerPort: 8080 name: metrics Access Prometheus by visiting https://<MASTER_IP>:8080

Configuring Prometheus

00:13:18

Lesson Description:

In this lesson you will learn about the Prometheus configuration file, how to configure static targets, as well as how to use service discovery to find Kubernetes endpoints. Below is the contents of prometheus.conf that was created by the Config Map. prometheus.conf: global: scrape_interval: 5s evaluation_interval: 5s scrape_configs: - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-nodes' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - job_name: 'kubernetes-cadvisor' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name Prometheus Configuration Documentation

Setting Up Grafana

00:04:40

Lesson Description:

In this lesson, you will learn how to deploy a Grafana pod and service to Kubernetes. Create grafana-deployment.yml. This file will be used to create the Grafana deployment. Be sure to change the password. grafana-deployment.yml: apiVersion: extensions/v1beta1 kind: Deployment metadata: name: grafana namespace: monitoring labels: app: grafana component: core spec: replicas: 1 template: metadata: labels: app: grafana component: core spec: containers: - image: grafana/grafana:3.1.1 name: grafana env: - name: GF_SECURITY_ADMIN_PASSWORD value: password ports: - containerPort: 3000 volumeMounts: - name: grafana-persistent-storage mountPath: /var volumes: - name: grafana-persistent-storage emptyDir: {} Deploy Grafana: kubectl apply -f grafana-deployment.yml Crate grafana-service.yml. This file will be used to make the pod publicly accessible. grafana-service.yml: apiVersion: v1 kind: Service metadata: name: grafana-service namespace: monitoring spec: selector: app: grafana type: NodePort ports: - port: 3000 targetPort: 3000 nodePort: 8000 Create the Grafana service: kubectl apply -f grafana-service.yml

NodeExporter

00:05:13

Lesson Description:

Repeat these steps on both your master and worker nodes. Create the Prometheus user: adduser prometheus Download Node Exporter: cd /home/prometheus curl -LO "https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.linux-amd64.tar.gz" tar -xvzf node_exporter-0.16.0.linux-amd64.tar.gz mv node_exporter-0.16.0.linux-amd64 node_exporter cd node_exporter chown prometheus:prometheus node_exporter vi /etc/systemd/system/node_exporter.service /etc/systemd/system/node_exporter.service: [Unit] Description=Node Exporter [Service] User=prometheus ExecStart=/home/prometheus/node_exporter/node_exporter [Install] WantedBy=default.target Reload systemd: systemctl daemon-reload Enable the node_exporter service: systemctl enable node_exporter.service Start the node_exporter service: systemctl start node_exporter.service

Expression Browser

00:04:40

Lesson Description:

In this lesson, you will learn how to use the Expression browser to execute queries, view your Prometheus configuration, and Prometheus targets. Container CPU load average: container_cpu_load_average_10s Memory usage query: ((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100

Adding a Grafana Dashboard

00:03:54

Lesson Description:

In this lesson, you will import a Grafana Dashboard that will be used to visualize metrics imported from the NodeExporter. Below are the links to the dashboard. Content Kubernetes Prometheus Env Repository Kubernetes Nodes Dashboard

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

01:00:00

Application Monitoring

Instrumenting Applications

00:05:37

Lesson Description:

This lesson discusses how to instrument an application by using a Prometheus client library. Though we will be talking about a NodeJS application, there are client libraries available for a wide variety of programming languages. You can clone the Comic Box App here.

Collecting Metrics from Applications

00:06:08

Lesson Description:

In this lesson, you will deploy a NodeJS application to Kubernets that will be monitored by Prometheus. Github Link: https://github.com/linuxacademy/content-kubernetes-prometheus-app Build a Docker image: docker build -t rivethead42/comicbox . Login to Docker Hub: docker login Push the image to Docker Hub: docker push < USERNAME >/comicbox Create a deployment using the image above: kubectl apply -f deployment.yml

PromQL

PromQL Basics

00:04:42

Lesson Description:

In this lesson, you will learn the basics of Prometheus' query language—PromQL. This includes queries using the metric name and then filtering it using labels. Return all time series with the metric node_cpu_seconds_total: node_cpu_seconds_total Return all time series with the metric node_cpu_seconds_total and the given job and mode labels: node_cpu_seconds_total{job="node-exporter", mode="idle"} Return a whole range of time (in this case 5 minutes) for the same vector, making it a range vector: node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m] Query job that end with -exporter: node_cpu_seconds_total{job=~".*-exporter"} Query job that begins with kube: container_cpu_load_average_10s{job=~"^kube.*"}

PromQL Operations and Functions

00:03:32

Lesson Description:

In this lesson, you will learn how to add operations and functions to your PromQL expressions. Arithmetic binary operators: + (addition)- (subtraction)* (multiplication)/ (division)% (modulo)^ (power/exponentiation) Comparison binary operators: == (equal)!= (not-equal)> (greater-than)< (less-than)>= (greater-or-equal)<= (less-or-equal) Logical/set binary operators: and (intersection)or (union)unless (complement) Aggregation operators: sum (calculate sum over dimensions)min (select minimum over dimensions)max (select maximum over dimensions)avg (calculate the average over dimensions)stddev (calculate population standard deviation over dimensions)stdvar (calculate population standard variance over dimensions)count (count number of elements in the vector)count_values (count number of elements with the same value)bottomk (smallest k elements by sample value)topk (largest k elements by sample value)quantile (calculate ?-quantile (0 ? ? ? 1) over dimensions) Get the total memory in bytes: node_memory_MemTotal_bytes Get a sum of the total memory in bytes: sum(node_memory_MemTotal_bytes) Get a percentage of total memory used: ((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100 Using a function with your query: irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m]) Using an operation and a function with your query: avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) Grouping your queries: avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) by (instance)

Recording Rules

00:07:03

Lesson Description:

Create prometheus-read-rules-map.yml. This file will be used to create a recording rule for Prometheus.prometheus-read-rules-map.yml: apiVersion: v1 kind: ConfigMap metadata: name: prometheus-read-rules-conf labels: name: prometheus-read-rules-conf namespace: monitoring data: node_rules.yml: |- groups: - name: node_rules interval: 10s rules: - record: instance:node_cpu:avg_rate5m expr: 100 - avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) by (instance) * 100 - record: instance:node_memory_usage:percentage expr: ((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100 - record: instance:root:node_filesystem_usage:percentage expr: (node_filesystem_size_bytes{mountpoint="/rootfs"} - node_filesystem_free_bytes{mountpoint="/rootfs"}) /node_filesystem_size_bytes{mountpoint="/rootfs"} * 100 Apply the recording rule: kubectl apply -f prometheus-read-rules-map.yml Update the prometheus-config-map.yml with record rules.prometheus-config-map.yml: apiVersion: v1 kind: ConfigMap metadata: name: prometheus-server-conf labels: name: prometheus-server-conf namespace: monitoring data: prometheus.yml: |- global: scrape_interval: 5s evaluation_interval: 5s rule_files: - rules/*_rules.yml scrape_configs: - job_name: 'node-exporter' static_configs: - targets: ['<KUBERNETES_IP>:9100', '<KUBERNETES_IP>:9100'] - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-nodes' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - job_name: 'kubernetes-cadvisor' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name Apply the update configuration file: kubectl apply -f prometheus-config-map.yml Add a new volume for the recording rules: apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prometheus-deployment namespace: monitoring spec: replicas: 1 template: metadata: labels: app: prometheus-server spec: containers: - name: prometheus image: prom/prometheus:v2.2.1 args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus/" - "--web.enable-lifecycle" ports: - containerPort: 9090 volumeMounts: - name: prometheus-config-volume mountPath: /etc/prometheus/ - name: prometheus-storage-volume mountPath: /prometheus/ - name: prometheus-read-rules-volume mountPath: /etc/prometheus/rules - name: watch image: weaveworks/watch:master-5b2a6e5 imagePullPolicy: IfNotPresent args: ["-v", "-t", "-p=/etc/prometheus", "-p=/var/prometheus", "curl", "-X", "POST", "--fail", "-o", "-", "-sS", "http://localhost:9090/-/reload"] volumeMounts: - name: prometheus-config-volume mountPath: /etc/prometheus volumes: - name: prometheus-config-volume configMap: defaultMode: 420 name: prometheus-server-conf - name: prometheus-read-rules-volume configMap: defaultMode: 420 name: prometheus-read-rules-conf - name: prometheus-storage-volume emptyDir: {} Apply the updates to the Prometheus deployment: kubectl apply -f prometheus-deployment.yml

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

01:00:00

Alerting

Alertmanager

00:12:28

Lesson Description:

In this lesson, you will learn how to set up Alertmanager to work with Prometheus. Below are the files that will be used to complete this task: Create a Config Map that will be used to set up the Alertmanager config file.alertmanager-configmap.yml: apiVersion: v1 kind: ConfigMap metadata: name: alertmanager-conf labels: name: alertmanager-conf namespace: monitoring data: alertmanager.yml: | global: smtp_smarthost: 'localhost:25' smtp_from: 'alertmanager@linuxacademy.org' smtp_require_tls: false route: receiver: slack_receiver receivers: - name: slack_receiver slack_configs: - send_resolved: true username: '<SLACK_USER>' api_url: '<APP_URL>' channel: '#<CHANNEL>' Create a deployment file that will be used to stand up the Alertmanager deployment.alertmanager-depoloyment.yml: apiVersion: extensions/v1beta1 kind: Deployment metadata: name: alertmanager namespace: monitoring spec: replicas: 1 template: metadata: labels: app: alertmanager spec: containers: - name: prometheus-alertmanager image: prom/alertmanager:v0.14.0 args: - --config.file=/etc/config/alertmanager.yml - --storage.path=/data - --web.external-url=/ ports: - containerPort: 9093 volumeMounts: - mountPath: /etc/config name: config-volume - mountPath: /data name: storage-volume - name: prometheus-alertmanager-configmap-reload image: jimmidyson/configmap-reload:v0.1 args: - --volume-dir=/etc/config - --webhook-url=http://localhost:9093/-/reload volumeMounts: - mountPath: /etc/config name: config-volume readOnly: true volumes: - configMap: defaultMode: 420 name: alertmanager-conf name: config-volume - emptyDir: {} name: storage-volume alertmanager-service.yml: apiVersion: v1 kind: Service metadata: name: alertmanager namespace: monitoring labels: app: alertmanager annotations: prometheus.io/scrape: 'true' prometheus.io/port: '9093' spec: selector: app: alertmanager type: NodePort ports: - port: 9093 targetPort: 9093 nodePort: 8081 Update the Prometheus config to include changes to rules and add the Alertmanager.prometheus-config-map.yml: apiVersion: v1 kind: ConfigMap metadata: name: prometheus-server-conf labels: name: prometheus-server-conf namespace: monitoring data: prometheus.yml: |- global: scrape_interval: 5s evaluation_interval: 5s alerting: alertmanagers: - kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_name] regex: alertmanager action: keep - source_labels: [__meta_kubernetes_namespace] regex: monitoring action: keep - source_labels: [__meta_kubernetes_pod_container_port_number] action: keep regex: 9093 rule_files: - "/var/prometheus/rules/*_rules.yml" - "/var/prometheus/rules/*_alerts.yml" scrape_configs: - job_name: 'node-exporter' static_configs: - targets: ['<KUBERNETES_IP>:9100', '<KUBERNETES_IP>:9100'] - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-nodes' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - job_name: 'kubernetes-cadvisor' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::d+)?;(d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name Create a Config Map that will be used to manage the recording and alerting rules.prometheus-rules-config-map.yml: apiVersion: v1 kind: ConfigMap metadata: creationTimestamp: null name: prometheus-rules-conf namespace: monitoring data: kubernetes_alerts.yml: | groups: - name: kubernetes_alerts rules: - alert: DeploymentGenerationOff expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation for: 5m labels: severity: warning annotations: description: Deployment generation does not match expected generation {{ $labels.namespace }}/{{ $labels.deployment }} summary: Deployment is outdated - alert: DeploymentReplicasNotUpdated expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas) or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas)) unless (kube_deployment_spec_paused == 1) for: 5m labels: severity: warning annotations: description: Replicas are not updated and available for deployment {{ $labels.namespace }}/{{ $labels.deployment }} summary: Deployment replicas are outdated - alert: PodzFrequentlyRestarting expr: increase(kube_pod_container_status_restarts_total[1h]) > 5 for: 10m labels: severity: warning annotations: description: Pod {{ $labels.namespace }}/{{ $labels.pod }} was restarted {{ $value }} times within the last hour summary: Pod is restarting frequently - alert: KubeNodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 1h labels: severity: warning annotations: description: The Kubelet on {{ $labels.node }} has not checked in with the API, or has set itself to NotReady, for more than an hour summary: Node status is NotReady - alert: KubeManyNodezNotReady expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0) > 1 and (count(kube_node_status_condition{condition="Ready",status="true"} == 0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2 for: 1m labels: severity: critical annotations: description: '{{ $value }}% of Kubernetes nodes are not ready' - alert: APIHighLatency expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4 for: 10m labels: severity: critical annotations: description: the API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }} - alert: APIServerErrorsHigh expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) * 100 > 5 for: 10m labels: severity: critical annotations: description: API server returns errors for {{ $value }}% of requests - alert: KubernetesAPIServerDown expr: up{job="kubernetes-apiservers"} == 0 for: 10m labels: severity: critical annotations: summary: Apiserver {{ $labels.instance }} is down! - alert: KubernetesAPIServersGone expr: absent(up{job="kubernetes-apiservers"}) for: 10m labels: severity: critical annotations: summary: No Kubernetes apiservers are reporting! description: Werner Heisenberg says - OMG Where are my apiserverz? prometheus_alerts.yml: | groups: - name: prometheus_alerts rules: - alert: PrometheusConfigReloadFailed expr: prometheus_config_last_reload_successful == 0 for: 10m labels: severity: warning annotations: description: Reloading Prometheus configuration has failed on {{$labels.instance}}. - alert: PrometheusNotConnectedToAlertmanagers expr: prometheus_notifications_alertmanagers_discovered < 1 for: 1m labels: severity: warning annotations: description: Prometheus {{ $labels.instance}} is not connected to any Alertmanagers node_alerts.yml: | groups: - name: node_alerts rules: - alert: HighNodeCPU expr: instance:node_cpu:avg_rate5m > 80 for: 10s labels: severity: warning annotations: summary: High Node CPU of {{ humanize $value}}% for 1 hour - alert: DiskWillFillIn4Hours expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4*3600) < 0 for: 5m labels: severity: critical annotations: summary: Disk on {{ $labels.instance }} will fill in approximately 4 hours. - alert: KubernetesServiceDown expr: up{job="kubernetes-service-endpoints"} == 0 for: 10m labels: severity: critical annotations: summary: Pod {{ $labels.instance }} is down! - alert: KubernetesServicesGone expr: absent(up{job="kubernetes-service-endpoints"}) for: 10m labels: severity: critical annotations: summary: No Kubernetes services are reporting! description: Werner Heisenberg says - OMG Where are my servicez? - alert: CriticalServiceDown expr: node_systemd_unit_state{state="active"} != 1 for: 2m labels: severity: critical annotations: summary: Service {{ $labels.name }} failed to start. description: Service {{ $labels.instance }} failed to (re)start service {{ $labels.name }}. redis_alerts.yml: | groups: - name: redis_alerts rules: - alert: RedisCacheMissesHigh expr: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total) > 0.8 for: 10m labels: severity: warning annotations: summary: Redis Server {{ $labels.instance }} Cache Misses are high. - alert: RedisRejectedConnectionsHigh expr: redis_connected_clients{} > 100 for: 10m labels: severity: warning annotations: summary: "Redis instance {{ $labels.addr }} may be hitting maxclient limit." description: "The Redis instance at {{ $labels.addr }} had {{ $value }} rejected connections during the last 10m and may be hitting the maxclient limit." - alert: RedisServerDown expr: redis_up{app="media-redis"} == 0 for: 10m labels: severity: critical annotations: summary: Redis Server {{ $labels.instance }} is down! - alert: RedisServerGone expr: absent(redis_up{app="media-redis"}) for: 1m labels: severity: critical annotations: summary: No Redis servers are reporting! description: Werner Heisenberg says - there is no uncertainty about the Redis server being gone. kubernetes_rules.yml: | groups: - name: kubernetes_rules rules: - record: apiserver_latency_seconds:quantile expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) / 1e+06 labels: quantile: "0.99" - record: apiserver_latency_seconds:quantile expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) / 1e+06 labels: quantile: "0.9" - record: apiserver_latency_seconds:quantile expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) / 1e+06 labels: quantile: "0.5" node_rules.yml: | groups: - name: node_rules rules: - record: instance:node_cpu:avg_rate5m expr: 100 - avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) by (instance) * 100 - record: instance:node_memory_usage:percentage expr: ((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100 - record: instance:root:node_filesystem_usage:percentage expr: (node_filesystem_size_bytes{mountpoint="/rootfs"} - node_filesystem_free_bytes{mountpoint="/rootfs"}) /node_filesystem_size_bytes{mountpoint="/rootfs"} * 100 redis_rules.yml: | groups: - name: redis_rules rules: - record: redis:command_call_duration_seconds_count:rate2m expr: sum(irate(redis_command_call_duration_seconds_count[2m])) by (cmd, environment) - record: redis:total_requests:rate2m expr: rate(redis_commands_processed_total[2m]) Update the volumes by the Prometheus deployment.prometheus-deployment.yml: apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prometheus-deployment namespace: monitoring spec: replicas: 1 template: metadata: labels: app: prometheus-server spec: containers: - name: prometheus image: prom/prometheus:v2.2.1 args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus/" - "--web.enable-lifecycle" ports: - containerPort: 9090 volumeMounts: - name: prometheus-config-volume mountPath: /etc/prometheus/ - name: prometheus-rules-volume mountPath: /var/prometheus/rules - name: prometheus-storage-volume mountPath: /prometheus/ - name: watch image: weaveworks/watch:master-5b2a6e5 imagePullPolicy: IfNotPresent args: ["-v", "-t", "-p=/etc/prometheus", "-p=/var/prometheus", "curl", "-X", "POST", "--fail", "-o", "-", "-sS", "http://localhost:9090/-/reload"] volumeMounts: - name: prometheus-config-volume mountPath: /etc/prometheus - name: prometheus-rules-volume mountPath: /var/prometheus/rules volumes: - name: prometheus-config-volume configMap: defaultMode: 420 name: prometheus-server-conf - name: prometheus-rules-volume configMap: name: prometheus-rules-conf - name: prometheus-storage-volume emptyDir: {}

Alerting Rules

00:06:49

Lesson Description:

In this lesson, you will learn how to create alerting rules that will be used to send alerts to Alertmanager. Below are the rules that were created in the previous lesson: apiVersion: v1 kind: ConfigMap metadata: creationTimestamp: null name: prometheus-rules-conf namespace: monitoring data: kubernetes_alerts.yml: | groups: - name: kubernetes_alerts rules: - alert: DeploymentGenerationOff expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation for: 5m labels: severity: warning annotations: description: Deployment generation does not match expected generation {{ $labels.namespace }}/{{ $labels.deployment }} summary: Deployment is outdated - alert: DeploymentReplicasNotUpdated expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas) or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas)) unless (kube_deployment_spec_paused == 1) for: 5m labels: severity: warning annotations: description: Replicas are not updated and available for deployment {{ $labels.namespace }}/{{ $labels.deployment }} summary: Deployment replicas are outdated - alert: PodzFrequentlyRestarting expr: increase(kube_pod_container_status_restarts_total[1h]) > 5 for: 10m labels: severity: warning annotations: description: Pod {{ $labels.namespace }}/{{ $labels.pod }} was restarted {{ $value }} times within the last hour summary: Pod is restarting frequently - alert: KubeNodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 1h labels: severity: warning annotations: description: The Kubelet on {{ $labels.node }} has not checked in with the API, or has set itself to NotReady, for more than an hour summary: Node status is NotReady - alert: KubeManyNodezNotReady expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0) > 1 and (count(kube_node_status_condition{condition="Ready",status="true"} == 0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2 for: 1m labels: severity: critical annotations: description: '{{ $value }}% of Kubernetes nodes are not ready' - alert: APIHighLatency expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4 for: 10m labels: severity: critical annotations: description: the API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }} - alert: APIServerErrorsHigh expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) * 100 > 5 for: 10m labels: severity: critical annotations: description: API server returns errors for {{ $value }}% of requests - alert: KubernetesAPIServerDown expr: up{job="kubernetes-apiservers"} == 0 for: 10m labels: severity: critical annotations: summary: Apiserver {{ $labels.instance }} is down! - alert: KubernetesAPIServersGone expr: absent(up{job="kubernetes-apiservers"}) for: 10m labels: severity: critical annotations: summary: No Kubernetes apiservers are reporting! description: Werner Heisenberg says - OMG Where are my apiserverz? prometheus_alerts.yml: | groups: - name: prometheus_alerts rules: - alert: PrometheusConfigReloadFailed expr: prometheus_config_last_reload_successful == 0 for: 10m labels: severity: warning annotations: description: Reloading Prometheus configuration has failed on {{$labels.instance}}. - alert: PrometheusNotConnectedToAlertmanagers expr: prometheus_notifications_alertmanagers_discovered < 1 for: 1m labels: severity: warning annotations: description: Prometheus {{ $labels.instance}} is not connected to any Alertmanagers node_alerts.yml: | groups: - name: node_alerts rules: - alert: HighNodeCPU expr: instance:node_cpu:avg_rate5m > 80 for: 10s labels: severity: warning annotations: summary: High Node CPU of {{ humanize $value}}% for 1 hour - alert: DiskWillFillIn4Hours expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4*3600) < 0 for: 5m labels: severity: critical annotations: summary: Disk on {{ $labels.instance }} will fill in approximately 4 hours. - alert: KubernetesServiceDown expr: up{job="kubernetes-service-endpoints"} == 0 for: 10m labels: severity: critical annotations: summary: Pod {{ $labels.instance }} is down! - alert: KubernetesServicesGone expr: absent(up{job="kubernetes-service-endpoints"}) for: 10m labels: severity: critical annotations: summary: No Kubernetes services are reporting! description: Werner Heisenberg says - OMG Where are my servicez? - alert: CriticalServiceDown expr: node_systemd_unit_state{state="active"} != 1 for: 2m labels: severity: critical annotations: summary: Service {{ $labels.name }} failed to start. description: Service {{ $labels.instance }} failed to (re)start service {{ $labels.name }}. redis_alerts.yml: | groups: - name: redis_alerts rules: - alert: RedisCacheMissesHigh expr: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total) > 0.8 for: 10m labels: severity: warning annotations: summary: Redis Server {{ $labels.instance }} Cache Misses are high. - alert: RedisRejectedConnectionsHigh expr: redis_connected_clients{} > 100 for: 10m labels: severity: warning annotations: summary: "Redis instance {{ $labels.addr }} may be hitting maxclient limit." description: "The Redis instance at {{ $labels.addr }} had {{ $value }} rejected connections during the last 10m and may be hitting the maxclient limit." - alert: RedisServerDown expr: redis_up{app="media-redis"} == 0 for: 10m labels: severity: critical annotations: summary: Redis Server {{ $labels.instance }} is down! - alert: RedisServerGone expr: absent(redis_up{app="media-redis"}) for: 1m labels: severity: critical annotations: summary: No Redis servers are reporting! description: Werner Heisenberg says - there is no uncertainty about the Redis server being gone. kubernetes_rules.yml: | groups: - name: kubernetes_rules rules: - record: apiserver_latency_seconds:quantile expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) / 1e+06 labels: quantile: "0.99" - record: apiserver_latency_seconds:quantile expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) / 1e+06 labels: quantile: "0.9" - record: apiserver_latency_seconds:quantile expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) / 1e+06 labels: quantile: "0.5" node_rules.yml: | groups: - name: node_rules rules: - record: instance:node_cpu:avg_rate5m expr: 100 - avg(irate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[5m])) by (instance) * 100 - record: instance:node_memory_usage:percentage expr: ((sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes)) * 100 - record: instance:root:node_filesystem_usage:percentage expr: (node_filesystem_size_bytes{mountpoint="/rootfs"} - node_filesystem_free_bytes{mountpoint="/rootfs"}) /node_filesystem_size_bytes{mountpoint="/rootfs"} * 100 redis_rules.yml: | groups: - name: redis_rules rules: - record: redis:command_call_duration_seconds_count:rate2m expr: sum(irate(redis_command_call_duration_seconds_count[2m])) by (cmd, environment) - record: redis:total_requests:rate2m expr: rate(redis_commands_processed_total[2m])

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

01:00:00

Conclusion

Final Steps

Next Steps

00:01:05

Lesson Description:

Not sure what to take next? Maybe these courses will pique your interest.