Kubernetes HPA with Custom Metrics from Prometheus

A useful example and step by step guide that can be used when K8s pod autoscaling based on Custom Metrics is needed.

Cezar Romaniuc
Towards Data Science

--

I decided to write these steps because I was recently involved in migrating a complex application from AWS to GCP and as with other systems of this kind we cannot meet their SLAs by relying on CPU or memory usage metrics only.

Autoscaling is an approach to automatically scale up or down workloads based on the resource usage. The K8s Horizontal Pod Autoscaler:

  • is implemented as a control loop that periodically queries the Resource Metrics API for core metrics, through metrics.k8s.io API, like CPU/memory and the Custom Metrics API for application-specific metrics (external.metrics.k8s.io or custom.metrics.k8s.io API. They are provided by “adapter” API servers offered by metrics solution vendors. There are some known solutions, but none of those implementations are officially part of Kubernetes)
  • automatically scales the number of pods in a deployment or replica set based on the observed metrics.

In what follows we’ll focus on the custom metrics because the Custom Metrics API made it possible for monitoring systems like Prometheus to expose application-specific metrics to the HPA controller.

In order to scale based on custom metrics we need to have two components:

  • One that collects metrics from our applications and stores them to Prometheus time series database.
  • The second one that extends the Kubernetes Custom Metrics API with the metrics supplied by a collector, the k8s-prometheus-adapter. This is an implementation of the custom metrics API that attempts to support arbitrary metrics.

Step-by-step guide on configuring HPA

  1. Let’s assume that we have the following two application (named myapplication) specific metrics published to Prometheus which is listening in our cluster on http://prometheus-server.prometheus:
myapplication_api_response_time_count{endpoint="api/users",environment="test",environment_type="development",instance="10.4.66.85:9102",job="myapplication-pods",namespace="myapplication",pod="myapplication-85cfb49cf6-kvl2v",status_code="2xx",verb="GET"}

and

myapplication_api_response_time_sum{endpoint="api/users",environment="test",environment_type="development",instance="10.4.66.85:9102",job="myapplication-pods",namespace="myapplication",pod="myapplication-85cfb49cf6-kvl2v",status_code="2xx",verb="GET"}

We would like to scale our application pods based on the endpoint latency.

2. Since we’ve got Prometheus metrics, it makes sense to use the Prometheus adapter to serve metrics out of Prometheus. A helm chart is listed on the Kubeapps Hub as stable/prometheus-adapter and can be used to install the adapter:

helm install --name my-release-name stable/prometheus-adapter

3. Configure the adapter with myapplication_api_response_time_avg custom metric:

prometheus-adapter:
prometheus:
url: http://prometheus-server.prometheus
port: 80

rules:
custom:
- seriesQuery: '{__name__=~"myapplication_api_response_time_.*",namespace!="",pod!=""}'
resources:
overrides:
namespace:
resource: namespace
pod:
resource: pod
name:
matches: ^(.*)
as: "myapplication_api_response_time_avg"
metricsQuery: 1000 * (sum(rate(myapplication_api_response_time_sum[5m]) > 0) by (<<.GroupBy>>) / sum(rate(myapplication_api_response_time_count[5m]) > 0) by (<<.GroupBy>>))

We are exposing myapplication_api_response_time_avg and this will be queried by HPA. Each rule has to specify a few resource overrides and metricsQuery tells the adapter which Prometheus query it should execute when retrieving data.

4. Check the value of the metric using the following command, which sends a raw GET request to the Kubernetes API server:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/myapplication/pods/*/myapplication_api_response_time_avg" | jq .

Response:

{
"kind":"MetricValueList",
"apiVersion":"custom.metrics.k8s.io/v1beta1",
"metadata":{
"selfLink":"/apis/custom.metrics.k8s.io/v1beta1/namespaces/myapplication/pods/*/myapplication_api_response_time_avg"
},
"items":[
{
"describedObject":{
"kind":"Pod",
"namespace":"myapplication",
"name":"myapplication-85cfb49cf6-54hhf",
"apiVersion":"/v1"
},
"metricName":"myapplication_api_response_time_avg",
"timestamp":"2020-06-24T07:24:13Z",
"value":"10750m",
"selector":null
},
{
"describedObject":{
"kind":"Pod",
"namespace":"myapplication",
"name":"myapplication-85cfb49cf6-kvl2v",
"apiVersion":"/v1"
},
"metricName":"myapplication_api_response_time_avg",
"timestamp":"2020-06-24T07:24:13Z",
"value":"12",
"selector":null
}
]
}

Notice that the API uses Kubernetes-style quantities to describe metric values. The most common to see in the metrics API is the m suffix, which means milli-units, or 1000ths of a unit. If the metric is exactly a whole number of units, we might not see a suffix.

For example, here, 10750mwould be 10,75 ms and 12 would be 12 ms.

5. Create a HPA that will scale up the myapplication-deployment if the latency exposed by myapplication_api_response_time_avg goes over 500 ms. After a couple of seconds, the HPA fetches the myapplication_api_response_time_avg value from the metrics API.

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: myapplication-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapplication-deployment
minReplicas: 3
maxReplicas: 15
metrics:
- type: Pods
pods:
metricName: myapplication_api_response_time_avg
targetAverageValue: "500"

6. Check the newly created HPA. We may notice that the autoscaler doesn’t react immediately to latency spikes. By default, the metrics sync happens once every 30 seconds and scaling up and down can only happen if there was no rescaling within the last 3–5 minutes. In this way, the HPA prevents rapid execution of conflicting decisions and gives time for the Cluster Autoscaler to kick in.

kubectl describe hpa myapplication-hpa -n myapplication

Conclusion

Dealing with autoscaling is a common task basically in every production ready systems and so is the case with the application that I mentioned about in the introduction where we had to autoscale based on latency, for example, in order to handle traffic bursts. By instrumenting this application and exposing the right metrics through Prometheus for autoscaling we could fine tune it to better handle bursts and ensure high availability.

--

--