While Machine Learning is just a subset of true Artificial Intelligence, vendors of infrastructure automation have coined a new buzz acronym, AIOps. On the heals of the still wet DevOps movement, we are introduced to the new era of DevOps that reaches beyond pipeline automation and into the realm of pipeline evolution. The popularity of agile development, continuous integration, and continuous delivery has brought levels of automation that rival anything previously known. The shift-left mentality has given development teams and product owners far more control over their release management. Fast and frequent releases of production-ready software now slide though automated tests, staging, and into production infrastructures. Fail-fast has become learn-fast where monitoring and scanning now provides the feedback we need to keep systems safe.

This whole paradigm is reaching a scale and magnitude that it no longer may depend on mere mortals to maintain it. A big reason is the proliferation of microservices-based applications in highly-redundant and highly-available cloud infrastructures. Add to that a desire for most enterprises to integrate cloud-based workloads with legacy on-premises applications, and we have complex hybrid cloud deployments to deal with as the result. So the question is now not whether to deploy, but when, where, why, and how?

Once complex deployments are initially made, the new concern of scaling becomes the next hurdle. N-Tier architectures and microservice applications must be tuned for performance. The days of scaling one tier beyond required capacity are over when the optimization of pay-as-you-use cloud billing models are employed. Many organizations utilize on-premises systems until they literally burst into the public cloud to gain added capacity. This bursting is intentional and guided by state-of-the-art monitoring and metrics to know exactly which tiers of the application need to be scaled to maintain SLAs (Service Level Agreements). AIOps is the culmination of the DevOps pipeline automation to accommodate this need for automated and continuous deployment, as well as automated and continuous scaling.

Scaling MultiCluster Kubernetes Infrastructure

The horizontal and vertical cluster scaling provided by Kubernetes is a cluster-centric approach to scaling. By using the Kubernetes Metrics Server, or metrics from tools such as Prometheus, a cluster may respond to resource demands when pre-programmed thresholds are surpassed. Adding CPU or memory to nodes may be done by initializing new nodes and vacating old ones. Horizontal ‘scale out’ approaches merely require the provisioning of additional worker nodes in an existing cluster. These solutions are proven to provide elasticity within clusters and provided a great buffer to prevent outages or resource overrun conditions.

However, scaling within an existing node may not always be the most suitable approach. In many cases, it becomes necessary in enterprise-grade production environments to add additional nodes. High-speed low-latency networks now allow us to add these nodes anywhere in a cloud infrastructure and configure them under existing load balancers. This means if storage is cheaper within one cloud datacenter, or user proximity warrants placement in foreign shore locations, that new clusters may be strategically placed in these chosen locales.

This type of scaling, to support hybrid and multi-cloud environments, requires that the scaling metrics and Kubernetes components be run from somewhere other than the actual cluster themselves. The concept of a bastion host or jump server now allows a scheduler and controller unaffiliated with the Kubernetes control plane to be run on behalf of an entire cloud infrastructure. This is known as cloud orchestration and works beyond Kubernetes cluster scaling to scale infrastructures and not just clusters.

Federating Metrics

Aggregating metrics from diverse nodes is feasible with tooling such as Prometheus. Prometheus federation is as simple as running a hierarchy of Prometheus servers that target other Prometheus servers for regular intervals of metrics scraping. The time-series data gathered by these servers may be organized with a data taxonomy facilitated by relabeling rules on each of the targeted systems. Once federated storage solutions other than the Prometheus time series database may be used to record metrics over longer periods of time. The granularity and comprehensive nature of metrics available from Prometheus Node Exporter and Google’s cAdvisor are sufficient to monitor just about any node-level metric that is meaningful. These metrics make up a great data source for further analytics.

Machine Learning

Once the metrics have been organized and aggregated, they may be used as training data for a variety of Machine Learning tests. Most metrics are linear in nature as utilization of systems increase over time with increases in workload. This linear nature of Prometheus metrics allows for a linear regression or multilinear regression approach to predictive analytics. Correlations between pod replicas, simultaneous sessions, network response times, CPU, memory, and storage may all be analyzed to forecast required capacity. Add to that the goal of utilizing the lowest cost available infrastructure and a Machine Learning algorithm can choose to deploy workloads to the most economical environment. When criterium such as high availability or reliability are involved, environments with superior fault tolerance may be chosen. A literally infinite set of correlations and possibilities may be developed that allow AIOps systems to deploy workloads in ways that improve performance while saving money.

Automated Kubernetes Deployments

The final piece of the puzzle is the ability for systems to deploy systems. OpenStack is a proven Infrastructure as a Service (IaaS) solution that has the ability to scale virtual machines based on the metrics of existing nodes. With public cloud environments, mainline vendors have provided command line interfaces and APIs (Application Program Interfaces) to allow systems to instantiate servers in an automated way. Take these cloud IaaS capabilities and implement them within a Kubernetes installer and you have the ability to increase the number of active nodes in a cluster and configure Kubernetes modules subsequent to the instantiation of the server. Kubernetes Operations (kops) is an example of one such installer. When configured initially, it allows a minimum and maximum number of nodes in the cluster to be specified. This enables the Kubernetes autoscaling.

When taken one step further, kops may be called by a Python module to add nodes to an existing cluster or create whole new clusters in other cloud environments. Since kops operates from a bastion host, it can be a control node that responds to the predictive analytics completed by the Machine Learning programs. This form of responding to metrics and predicting needed capacity may be done in ways to forecast time of day increases and retraction, or even seasonal workloads, such as those experienced by retailers and payment providers in the United States on Black Friday.

Since scaling cloud infrastructures with additional machine instances, and configuring Kubernetes clusters to lay down on them, takes tens of minutes, having proactive governors of capacity may be the difference between outages and blackouts verses satisfied customers. To achieve this capacity with conventional virtualization, it has been necessary to overbuild web farms and instantiate excess capacity. AIOps optimizes that by providing the opportunity to scale and retract as needed, eliminating the server sprawl that has created the entropy of virtualization.

Conclusions

At the time of this writing, Gartner has only recently begun to publish quadrants of AIOps solutions. Most of them are proprietary and few open-source solutions exist. However, with the open-source tooling such as Prometheus, and Machine-Learning-friendly languages such as Python, and the availability of open-source installers such as kops, it is just a question of time before we all enjoy the smartest and most efficient systems administrator of our lifetime, an AI!

Linux Academy has recently published courses covering the AIOps and Python technologies mentioned in this article.

 

 

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *