Skip to content

Instantly share code, notes, and snippets.

@max-rocket-internet
Last active December 16, 2024 02:47
Show Gist options
  • Save max-rocket-internet/6a05ee757b6587668a1de8a5c177728b to your computer and use it in GitHub Desktop.
Save max-rocket-internet/6a05ee757b6587668a1de8a5c177728b to your computer and use it in GitHub Desktop.
How to display Kubernetes request and limit in Grafana / Prometheus properly

CPU: percentage of limit

A lot of people land when trying to find out how to calculate CPU usage metric correctly in prometheus, myself included! So I'll post what I eventually ended up using as I think it's still a little difficult trying to tie together all the snippets of info here and elsewhere.

This is specific to k8s and containers that have CPU limits set.

To show CPU usage as a percentage of the limit given to the container, this is the Prometheus query we used to create nice graphs in Grafana:

sum(rate(container_cpu_usage_seconds_total{name!~".*prometheus.*", image!="", container_name!="POD"}[5m])) by (pod_name, container_name) /
sum(container_spec_cpu_quota{name!~".*prometheus.*", image!="", container_name!="POD"}/container_spec_cpu_period{name!~".*prometheus.*", image!="", container_name!="POD"}) by (pod_name, container_name)

It returns a number between 0 and 1 so format the left Y axis as percent (0.0-1.0) or multiply by 100 to get CPU usage percentage.

Note that we added some filtering here to get rid of some noise: name!~".*prometheus.*", image!="", container_name!="POD". The name!~".*prometheus.*" is just because we aren't interested in the CPU usage of all the prometheus exporters running in our k8s cluster.

Screen Shot 2019-04-24 at 10 58 31 (Title on this image is wrong)

CPU: show as cores with request/limit lines

Since some applications have a small request and large limit (to save money) or have an HPA, then just showing a percentage of the limit is sometimes not useful.

So what we do now is display the CPU usage in cores and then add a horizontal line for each of the request and limit. This shows more information and also shows the usage in the same metric that is used in k8s: CPU cores.

CPU usage

Legend: {{container_name}} in {{pod_name}} Query: sum(rate(container_cpu_usage_seconds_total{pod_name=~"deployment-name-[^-]*-[^-]*$", image!="", container_name!="POD"}[5m])) by (pod_name, container_name)

CPU limit

Legend: limit Query: sum(kube_pod_container_resource_limits_cpu_cores{pod=~"deployment-name-[^-]*-[^-]*$"}) by (pod)

CPU request

Legend: request Query: sum(kube_pod_container_resource_requests_cpu_cores{pod=~"deployment-name-[^-]*-[^-]*$"}) by (pod)

You will need to edit these 3 queries for your environment so that only pods from a single deployment a returned, e.g. replace deployment-name.

The pod request/limit metrics come from kube-state-metrics.

We then add 2 series overrides to hide the request and limit in the tooltip and legend:

Screen Shot 2020-01-13 at 17 05 03

The result looks like this:

Screen Shot 2020-01-14 at 10 05 20

Queries to show memory and CPU as percentage of both request and limit

Percentage of CPU request:

round(
  100 *
    sum(
      rate(container_cpu_usage_seconds_total{container_name!="POD"}[5m])
    ) by (pod, container_name, namespace, slave)
      /
    sum(
      kube_pod_container_resource_requests_cpu_cores{container_name!="POD"}
    ) by (pod, container_name, namespace, slave)
)

Percentage of CPU limit:

round(
  100 *
    sum(
      rate(container_cpu_usage_seconds_total{image!="", container_name!="POD"}[5m])
    ) by (pod_name, container_name, namespace, slave)
      /
    sum(
      container_spec_cpu_quota{image!="", container_name!="POD"} / container_spec_cpu_period{image!="", container_name!="POD"}
    ) by (pod_name, container_name, namespace, slave)
)

Percentage of memory request:

round(
  100 *
    sum(container_memory_working_set_bytes{image!="", container_name!="POD"}) by (container, pod, namespace, slave)
      /
    sum(kube_pod_container_resource_requests_memory_bytes{container_name!="POD"} > 0) by (container, pod, namespace, slave)
)

Percentage of memory limit:

round(
  100 *
    sum(container_memory_working_set_bytes{image!="", container_name!="POD"}) by (container, pod_name, namespace, slave)
      /
    sum(container_spec_memory_limit_bytes{image!="", container_name!="POD"} > 0) by (container, pod_name, namespace, slave)
)
@athreyapatel
Copy link

athreyapatel commented Mar 15, 2021

Hey,
Its a great article
Have you tried a query for getting CPU Usage % and Memory Usage % based on number of machine cpu cores and memory respectively?
It would be great if you can add that too

@ntantri
Copy link

ntantri commented Jun 22, 2021

Great article!

@ctouil
Copy link

ctouil commented Nov 29, 2021

Hey!!
Can someone help me have a query that represents Pod's CPU and Memory Usage per Node?
I've tested those two queries but they didn't mention the node variable.
CPU: sum(node_namespace_pod:container_cpu_usage_seconds_total:sum_rate) by (pod_name)
Memory: sum(namespace_pod_name_container_name:container_memory_usage_bytes:sum_rate) by (pod_name)
I want to add the node variable to have different results for each modification of the name of thenode.

@Lemon-le
Copy link

Lemon-le commented Jun 1, 2022

Great article!

@sillyfrog
Copy link

sillyfrog commented Jul 21, 2022

Thanks for posting such a concise and useful article - for a newbie this is a real help!

For my setup (running in AWS with Fargate), I found I needed to adapt the above to use the kube_pod_container_resource_requests key, for example, Percentage of CPU request:

round(
  100 *
    sum(
      rate(container_cpu_usage_seconds_total{container_name!="POD"}[1m])
    ) by (pod, container_name, namespace, slave)
      /
    sum(
      kube_pod_container_resource_requests{container_name!="POD",resource="cpu"}
    ) by (pod, container_name, namespace, slave)
)

Or of CPU limit:

round(
  100 *
    sum(
      rate(container_cpu_usage_seconds_total{container_name!="POD"}[1m])
    ) by (pod, container_name, namespace, slave)
      /
    sum(
      kube_pod_container_resource_limits{container_name!="POD",resource="cpu"}
    ) by (pod, container_name, namespace, slave)
)

And memory request:

round(
  100 *
    sum(container_memory_working_set_bytes{image!="", container_name!="POD"}) by (container, pod, namespace, slave)
      /
    sum(kube_pod_container_resource_requests{container_name!="POD",resource="memory"} > 0) by (container, pod, namespace, slave)
)

As mentioned, I'm new to this, so may not have the correct setup to get the keys mentioned above, but this looks like it'll work for me.

@mgfnv9
Copy link

mgfnv9 commented Jan 25, 2023

Thanks for posting, this useful article. May you share this as dashboard?

@wwwhaoxu
Copy link

Percentage of CPU limit, why exceed 100%

@dongjiang1989
Copy link

Thanks for your posting. with kube-state-metrics v2.6.0, I found I needed to adapt the above to use the kube_pod_container_resource_requests or kube_pod_container_resource_limits.

Percentage of CPU request:

round(
  100 *
    sum(
      rate(container_cpu_usage_seconds_total[1m])
    ) by (pod, container, namespace)
      /
    sum(
      kube_pod_container_resource_requests{resource="cpu"}
    ) by (pod, container, namespace)
)

Percentage of CPU limit:

round(
  100 *
    sum(
      rate(container_cpu_usage_seconds_total[1m])
    ) by (pod, container, namespace)
      /
    sum(
      kube_pod_container_resource_limits{resource="cpu"}
    ) by (pod, container, namespace)
)

Percentage of memory request:

round(
  100 *
    sum(container_memory_working_set_bytes{image!=""}) by (pod, container, namespace)
      /
    sum(kube_pod_container_resource_requests{resource="memory"} > 0) by (pod, container, namespace)
)

Percentage of memory limits:

round(
  100 *
    sum(container_memory_working_set_bytes{image!=""}) by (pod, container, namespace)
      /
    sum(kube_pod_container_resource_limits{resource="memory"} > 0) by (pod, container, namespace)
)

@dongjiang1989
Copy link

dongjiang1989 commented Nov 17, 2023

Percentage of CPU limit, why exceed 100%

Maybe open cpu burst feature in cpu.cfs_burst_us

@lutz108
Copy link

lutz108 commented Oct 10, 2024

Thank you very much for this article, really helps me to get/calculate stats of our gitlab-runner pods. However, there is a part I stumbled so often while trying to find good expression to calculate the CPU usage of a container or machine.
What would you say regarding the Load and Utilization? In general, I would rank the load as a more useful metric to see if a system is slowed down by a large load. A large Utilization however does not necessarily reflect slowed down processes, it can also reflect a good usage of the provided resources.
On a node/machine level, I would tend to use the load. The here discussed queries are Utilization based. Any hints on how to get container load or reasons why it's not a useful metric?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment