Prometheus Cheatsheet - PromQL Queries & Configuration - prometheus Cheatsheets

Prometheus Cheatsheet Overview

Prometheus Metric Types and Basics
Curated Examples by Exporter
PromQL Example Queries
Prometheus Scrape Configurations
Prometheus Recording Rules
External Prometheus Resources

Prometheus Metric Types and Basics

Counter: A metric that only increases. Useful for tracking counts of events like requests or errors.
Gauge: A metric that can increase or decrease. Ideal for measuring current values like memory usage or temperature.
Histogram: A metric that samples observations (e.g., request durations) and counts them in configurable buckets. Also provides a sum and count of all observations.
Summary: Similar to a histogram, but calculates quantiles on the client side. Less common than histograms in modern Prometheus setups.
Source and Statistics 101

PromQL Query Functions

rate: Calculates the per-second average rate of increase for a counter over a specified time window.
irate: Calculates the *instantaneous* per-second rate of increase for a counter, using only the last two data points. Best for volatile counters.
increase: Calculates the total increase of a counter over a specified time frame.
resets: Counts the number of times a counter has been reset within a given time window.

Curated Examples by Exporter

Example queries categorized by common exporters:

Node Exporter Metrics

PromQL Example Queries General Queries

Show me all the metric names for the job=app:

group ({job="app"}) by (__name__)

How many nodes are up?

up

Count targets per job:

count by (job) (up)

Combining and Manipulating Metrics

Combining values from 2 different vectors (Hostname with a Metric):

up * on(instance) group_left(nodename) (node_uname_info)

Exclude labels:

sum without(job) (up * on(instance) group_left(nodename) (node_uname_info))

Memory and CPU Usage

Amount of Memory Available:

node_memory_MemAvailable_bytes

Amount of Memory Available in MB:

node_memory_MemAvailable_bytes / 1024 / 1024

Amount of Memory Available in MB 10 minutes ago:

node_memory_MemAvailable_bytes / 1024 / 1024 offset 10m

Average Memory Available for Last 5 Minutes:

avg_over_time(node_memory_MemFree_bytes[5m]) / 1024 / 1024

Memory Usage in Percent:

100 * (1 - ((avg_over_time(node_memory_MemFree_bytes[10m]) + avg_over_time(node_memory_Cached_bytes[10m]) + avg_over_time(node_memory_Buffers_bytes[10m])) / avg_over_time(node_memory_MemTotal_bytes[10m])))

CPU Utilization:

100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 )

CPU Utilization offset with 24 hours ago:

100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m] offset 24h)) * 100 )

CPU Utilization per Core:

( (1 - rate(node_cpu_seconds_total{job="node-exporter", mode="idle", instance="$instance"}[$__interval])) / ignoring(cpu) group_left() count without (cpu)( node_cpu_seconds_total{job="node-exporter", mode="idle", instance="$instance"}) )

CPU Utilization by Node:

100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[10m]) * 100) * on(instance) group_left(nodename) (node_uname_info))

Memory Available by Node:

node_memory_MemAvailable_bytes * on(instance) group_left(nodename) (node_uname_info)

Or if you rely on labels from other metrics:

(node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemFree_bytes{job="node-exporter"} - node_memory_Buffers_bytes{job="node-exporter"} - node_memory_Cached_bytes{job="node-exporter"}) * on(instance) group_left(nodename) (node_uname_info{nodename=~"$nodename"})

Load Average in percentage:

avg(node_load1{instance=~"$name", job=~"$job"}) / count(count(node_cpu_seconds_total{instance=~"$name", job=~"$job"}) by (cpu)) * 100

Load Average per Instance:

sum(node_load5{}) by (instance) / count(node_cpu_seconds_total{mode="user"}) by (instance) * 100

Load Average (average per instance_id: lets say the metric has 2 identical label values but are different):

avg by (instance_id, instance) (node_load1{job=~"node-exporter", aws_environment="dev", instance="debug-dev"})

Disk and Network IO

Disk Available by Node:

node_filesystem_free_bytes{mountpoint="/"} * on(instance) group_left(nodename) (node_uname_info)

Disk IO per Node: Outbound:

sum(rate(node_disk_read_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)

Disk IO per Node: Inbound:

sum(rate(node_disk_written_bytes_total{job="node"}[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)

Network IO per Node:

sum(rate(node_network_receive_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)
sum(rate(node_network_transmit_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)

Process and Container Metrics

Process Restarts:

changes(process_start_time_seconds{job=~".+"}[15m])

Container Cycling:

(time() - container_start_time_seconds{job=~".+"}) < 60

Histogram and Time-Series Analysis

Histogram Quantile (e.g., 95th percentile request duration):

histogram_quantile(0.95, sum(rate(prometheus_http_request_duration_seconds_bucket[5m])) by (le, handler)) * 1e3

Metrics 24 hours ago (useful for comparison):

# query a
total_number_of_errors{instance="my-instance", region="eu-west-1"}
# query b
total_number_of_errors{instance="my-instance", region="eu-west-1"} offset 24h

Container Orchestration (Kubernetes/Swarm)

Number of Nodes (Up):

count(up{job="cadvisor_my-swarm"})

Running Containers per Node:

count(container_last_seen) BY (container_label_com_docker_swarm_node_id)

Running Containers per Node, include corresponding hostnames:

count(container_last_seen) BY (container_label_com_docker_swarm_node_id) * ON (container_label_com_docker_swarm_node_id) GROUP_LEFT(node_name) node_meta

HAProxy Metrics

HAProxy Response Codes:

haproxy_server_http_responses_total{backend=~"$backend", server=~"$server", code=~"$code", alias=~"$alias"} > 0

Resource Usage and Top Metrics

Metrics with the most resources (time series count):

topk(10, count by (__name__)({__name__=~".+"}))

Top 10 metrics per job:

topk(10, count by (__name__, job)({__name__=~".+"}))

Jobs with the most time series:

topk(10, count by (job)({__name__=~".+"}))

Top 5 per value (e.g., highest costs):

sort_desc(topk(5, aws_service_costs))

Table - Top 5 (enable instant as well):

sort(topk(5, aws_service_costs))

Most metrics per job, sorted:

sort_desc (sum by (job) (count by (__name__, job)({job=~".+"})))

Group per Day (Table) - WIP

aws_service_costs{service=~"$service"} + ignoring(year, month, day) group_right
  count_values without() ("year", year(timestamp(
    count_values without() ("month", month(timestamp(
      count_values without() ("day", day_of_month(timestamp(
        aws_service_costs{service=~"$service"}
      )))
    )))
  ))) * 0

Group Metrics per node hostname:

node_memory_MemAvailable_bytes * on(instance) group_left(nodename) (node_uname_info)

Subtract two gauge metrics (exclude the label that doesn't match):

polkadot_block_height{instance="polkadot", chain=~"$chain", status="sync_target"} - ignoring(status) polkadot_block_height{instance="polkadot", chain=~"$chain", status="finalized"}

Conditional joins when labels exist:

(
    # For all sensors that have a name (label "label"), join them with `node_hwmon_sensor_label` to get that name.
    (node_hwmon_temp_celsius * ignoring(label) group_left(label) node_hwmon_sensor_label)
  or
    # For all sensors that do NOT have a name (label "label") in `node_hwmon_sensor_label`, assign them `label="unknown-sensor-name"`.
    (label_replace((node_hwmon_temp_celsius unless ignoring(label) node_hwmon_sensor_label), "label", "unknown-sensor-name", "", ".*"))
)

Container CPU Average for 5m:

(sum by(instance, container_label_com_amazonaws_ecs_container_name, container_label_com_amazonaws_ecs_cluster) (rate(container_cpu_usage_seconds_total[5m])) * 100)

Container Memory Usage: Total:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"})

Container Memory, per Task, Node:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_swarm_task_name, container_label_com_docker_swarm_node_id)

Container Memory per Node:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_swarm_node_id)

Memory Usage per Stack:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_stack_namespace)

Remove metrics from results that do not contain a specific label:

container_cpu_usage_seconds_total{container_label_com_amazonaws_ecs_cluster!=""}

Remove labels from a metric:

sum without (age, country) (people_metrics)

View top 10 biggest metrics by name:

topk(10, count by (__name__)({__name__=~".+"}))

View top 10 biggest metrics by name, job:

topk(10, count by (__name__, job)({__name__=~".+"}))

View all metrics for a specific job:

{__name__=~".+", job="node-exporter"}

View all metrics for more than one job using vector selectors:

{__name__=~".+", job=~"traefik|cadvisor|prometheus"}

Website uptime with blackbox-exporter:

# https://www.robustperception.io/what-percentage-of-time-is-my-service-down-for
avg_over_time(probe_success{job="node"}[15m]) * 100

Remove / Replace label:

Prometheus Label Replace/Remove

Client and Server Request Metrics

Client Request Counts:

irate(http_client_requests_seconds_count{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m])

Client Response Time:

irate(http_client_requests_seconds_sum{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m]) /
irate(http_client_requests_seconds_count{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m])

Requests per Second:

sum(increase(http_server_requests_seconds_count{service="my-service", env="dev"}[1m])) by (uri)

is the same as:

sum(rate(http_server_requests_seconds_count{service="my-service", env="dev"}[1m]) * 60 ) by (uri)

See this SO thread for more details.

p95 Request Latencies with histogram_quantile (the latency experienced by the slowest 5% of requests in seconds):

histogram_quantile(0.95, sum by (le, store) (rate(myapp_latency_seconds_bucket{application="product-service", category=~".+"}[5m])))

Resource Requests and Limits (Kubernetes)

For CPU: average rate of CPU usage over 15 minutes:

rate(container_cpu_usage_seconds_total{job="kubelet",container="my-application"}[15m])

For Memory: shows usage in MB:

container_memory_usage_bytes{job="kubelet",container="my-application"} / (1024 * 1024)

Prometheus Scrape Configurations Relabeling Configurations

Example of relabeling to set the `instance` label:

scrape_configs:
  - job_name: 'multipass-nodes'
    static_configs:
    - targets: ['ip-192-168-64-29.multipass:9100']
      labels:
        env: test
    - targets: ['ip-192-168-64-30.multipass:9100']
      labels:
        env: test
    relabel_configs:
    - source_labels: [__address__]
      separator: ':'
      regex: '(.*):(.*)'
      replacement: '${1}'
      target_label: instance

Static Configurations

Scraping a static list of targets:

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
         - targets: ['localhost:9090']
      labels:
        region: 'eu-west-1'

Service Discovery (DNS SD)

Using DNS service discovery for MySQL exporter:

scrape_configs:
  - job_name: 'mysql-exporter'
    scrape_interval: 5s
    dns_sd_configs:
    - names:
      - 'tasks.mysql-exporter'
      type: 'A'
      port: 9104
    relabel_configs:
    - source_labels: [__address__]
      regex: '.*'
      target_label: instance
      replacement: 'mysqld-exporter'

More DNS SD Examples

Grafana with Prometheus

Customizing legend labels in Grafana:

# If your output is like: {instance="10.0.2.66:9100",job="node",nodename="rpi-02"}
# Use this in the Grafana "Legend" input:
{{nodename}}

Displaying specific labels:

# If you want to show 'exported_instance'
sum(exporter_memory_usage{exported_instance="myapp"})
# You might need to group by it:
sum by (exported_instance) (exporter_memory_usage{exported_instance="my_app"})
# Then in Grafana Legend:
{{exported_instance}}

Grafana Variables

Hostname Variable:
Name: node

Label: node

Query: label_values(node_uname_info, nodename)

Usage in query: node_uname_info{nodename=~"$node"}
Node Exporter Address Variable:
Type: Query

Query: label_values(node_network_up, instance)
MySQL Exporter Address Variable:
Type: Query

Query: label_values(mysql_up, instance)
Static Values Variable:
Type: Custom

Name: dc

Label: dc

Values: eu-west-1a,eu-west-1b,eu-west-1c
Docker Swarm Stack Names Variable:
Name: stack

Label: stack

Query: label_values(container_last_seen,container_label_com_docker_stack_namespace)
Docker Swarm Service Names Variable:
Name: service_name

Label: service_name

Query: label_values(container_last_seen,container_label_com_docker_swarm_service_name)
Docker Swarm Manager NodeId Variable:
Name: manager_node_id

Label: manager_node_id

Query: label_values(container_last_seen{container_label_com_docker_swarm_service_name=~"proxy_traefik", container_label_com_docker_swarm_node_id=~".+"}, container_label_com_docker_swarm_node_id)
Docker Swarm Stacks Running on Managers Variable:
Name: stack_on_manager

Label: stack_on_manager

Query: label_values(container_last_seen{container_label_com_docker_swarm_node_id=~"$manager_node_id"},container_label_com_docker_stack_namespace)

Prometheus Recording Rules

@deploy.live's Recording Rules Post

Application Instrumentation Python Flask

@ramdesh flask-prometheus-grafana-example

External Prometheus Resources