Prometheus Cheatsheet - PromQL Queries & Configuration

Comprehensive Prometheus cheatsheet with PromQL query examples, scrape configurations, recording rules, and external resources for effective monitoring.

Prometheus Cheatsheet
Prometheus Cheatsheet Overview Prometheus Metric Types and Basics
  • Counter: A metric that only increases. Useful for tracking counts of events like requests or errors.
  • Gauge: A metric that can increase or decrease. Ideal for measuring current values like memory usage or temperature.
  • Histogram: A metric that samples observations (e.g., request durations) and counts them in configurable buckets. Also provides a sum and count of all observations.
  • Summary: Similar to a histogram, but calculates quantiles on the client side. Less common than histograms in modern Prometheus setups.
  • Source and Statistics 101
PromQL Query Functions
  • rate: Calculates the per-second average rate of increase for a counter over a specified time window.
  • irate: Calculates the *instantaneous* per-second rate of increase for a counter, using only the last two data points. Best for volatile counters.
  • increase: Calculates the total increase of a counter over a specified time frame.
  • resets: Counts the number of times a counter has been reset within a given time window.
Curated Examples by Exporter

Example queries categorized by common exporters:

  • Node Exporter Metrics
PromQL Example Queries General Queries

Show me all the metric names for the job=app:

group ({job="app"}) by (__name__)

How many nodes are up?

up

Count targets per job:

count by (job) (up)
Combining and Manipulating Metrics

Combining values from 2 different vectors (Hostname with a Metric):

up * on(instance) group_left(nodename) (node_uname_info)

Exclude labels:

sum without(job) (up * on(instance) group_left(nodename) (node_uname_info))
Memory and CPU Usage

Amount of Memory Available:

node_memory_MemAvailable_bytes

Amount of Memory Available in MB:

node_memory_MemAvailable_bytes / 1024 / 1024

Amount of Memory Available in MB 10 minutes ago:

node_memory_MemAvailable_bytes / 1024 / 1024 offset 10m

Average Memory Available for Last 5 Minutes:

avg_over_time(node_memory_MemFree_bytes[5m]) / 1024 / 1024

Memory Usage in Percent:

100 * (1 - ((avg_over_time(node_memory_MemFree_bytes[10m]) + avg_over_time(node_memory_Cached_bytes[10m]) + avg_over_time(node_memory_Buffers_bytes[10m])) / avg_over_time(node_memory_MemTotal_bytes[10m])))

CPU Utilization:

100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 )

CPU Utilization offset with 24 hours ago:

100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m] offset 24h)) * 100 )

CPU Utilization per Core:

( (1 - rate(node_cpu_seconds_total{job="node-exporter", mode="idle", instance="$instance"}[$__interval])) / ignoring(cpu) group_left() count without (cpu)( node_cpu_seconds_total{job="node-exporter", mode="idle", instance="$instance"}) )

CPU Utilization by Node:

100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[10m]) * 100) * on(instance) group_left(nodename) (node_uname_info))

Memory Available by Node:

node_memory_MemAvailable_bytes * on(instance) group_left(nodename) (node_uname_info)

Or if you rely on labels from other metrics:

(node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemFree_bytes{job="node-exporter"} - node_memory_Buffers_bytes{job="node-exporter"} - node_memory_Cached_bytes{job="node-exporter"}) * on(instance) group_left(nodename) (node_uname_info{nodename=~"$nodename"})

Load Average in percentage:

avg(node_load1{instance=~"$name", job=~"$job"}) / count(count(node_cpu_seconds_total{instance=~"$name", job=~"$job"}) by (cpu)) * 100

Load Average per Instance:

sum(node_load5{}) by (instance) / count(node_cpu_seconds_total{mode="user"}) by (instance) * 100

Load Average (average per instance_id: lets say the metric has 2 identical label values but are different):

avg by (instance_id, instance) (node_load1{job=~"node-exporter", aws_environment="dev", instance="debug-dev"})
Disk and Network IO

Disk Available by Node:

node_filesystem_free_bytes{mountpoint="/"} * on(instance) group_left(nodename) (node_uname_info)

Disk IO per Node: Outbound:

sum(rate(node_disk_read_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)

Disk IO per Node: Inbound:

sum(rate(node_disk_written_bytes_total{job="node"}[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)

Network IO per Node:

sum(rate(node_network_receive_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)
sum(rate(node_network_transmit_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)
Process and Container Metrics

Process Restarts:

changes(process_start_time_seconds{job=~".+"}[15m])

Container Cycling:

(time() - container_start_time_seconds{job=~".+"}) < 60
Histogram and Time-Series Analysis

Histogram Quantile (e.g., 95th percentile request duration):

histogram_quantile(0.95, sum(rate(prometheus_http_request_duration_seconds_bucket[5m])) by (le, handler)) * 1e3

Metrics 24 hours ago (useful for comparison):

# query a
total_number_of_errors{instance="my-instance", region="eu-west-1"}
# query b
total_number_of_errors{instance="my-instance", region="eu-west-1"} offset 24h
Container Orchestration (Kubernetes/Swarm)

Number of Nodes (Up):

count(up{job="cadvisor_my-swarm"})

Running Containers per Node:

count(container_last_seen) BY (container_label_com_docker_swarm_node_id)

Running Containers per Node, include corresponding hostnames:

count(container_last_seen) BY (container_label_com_docker_swarm_node_id) * ON (container_label_com_docker_swarm_node_id) GROUP_LEFT(node_name) node_meta
HAProxy Metrics

HAProxy Response Codes:

haproxy_server_http_responses_total{backend=~"$backend", server=~"$server", code=~"$code", alias=~"$alias"} > 0
Resource Usage and Top Metrics

Metrics with the most resources (time series count):

topk(10, count by (__name__)({__name__=~".+"}))

Top 10 metrics per job:

topk(10, count by (__name__, job)({__name__=~".+"}))

Jobs with the most time series:

topk(10, count by (job)({__name__=~".+"}))

Top 5 per value (e.g., highest costs):

sort_desc(topk(5, aws_service_costs))

Table - Top 5 (enable instant as well):

sort(topk(5, aws_service_costs))

Most metrics per job, sorted:

sort_desc (sum by (job) (count by (__name__, job)({job=~".+"})))

Group per Day (Table) - WIP

aws_service_costs{service=~"$service"} + ignoring(year, month, day) group_right
  count_values without() ("year", year(timestamp(
    count_values without() ("month", month(timestamp(
      count_values without() ("day", day_of_month(timestamp(
        aws_service_costs{service=~"$service"}
      )))
    )))
  ))) * 0

Group Metrics per node hostname:

node_memory_MemAvailable_bytes * on(instance) group_left(nodename) (node_uname_info)

Subtract two gauge metrics (exclude the label that doesn't match):

polkadot_block_height{instance="polkadot", chain=~"$chain", status="sync_target"} - ignoring(status) polkadot_block_height{instance="polkadot", chain=~"$chain", status="finalized"}

Conditional joins when labels exist:

(
    # For all sensors that have a name (label "label"), join them with `node_hwmon_sensor_label` to get that name.
    (node_hwmon_temp_celsius * ignoring(label) group_left(label) node_hwmon_sensor_label)
  or
    # For all sensors that do NOT have a name (label "label") in `node_hwmon_sensor_label`, assign them `label="unknown-sensor-name"`.
    (label_replace((node_hwmon_temp_celsius unless ignoring(label) node_hwmon_sensor_label), "label", "unknown-sensor-name", "", ".*"))
)

Container CPU Average for 5m:

(sum by(instance, container_label_com_amazonaws_ecs_container_name, container_label_com_amazonaws_ecs_cluster) (rate(container_cpu_usage_seconds_total[5m])) * 100)

Container Memory Usage: Total:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"})

Container Memory, per Task, Node:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_swarm_task_name, container_label_com_docker_swarm_node_id)

Container Memory per Node:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_swarm_node_id)

Memory Usage per Stack:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_stack_namespace)

Remove metrics from results that do not contain a specific label:

container_cpu_usage_seconds_total{container_label_com_amazonaws_ecs_cluster!=""}

Remove labels from a metric:

sum without (age, country) (people_metrics)

View top 10 biggest metrics by name:

topk(10, count by (__name__)({__name__=~".+"}))

View top 10 biggest metrics by name, job:

topk(10, count by (__name__, job)({__name__=~".+"}))

View all metrics for a specific job:

{__name__=~".+", job="node-exporter"}

View all metrics for more than one job using vector selectors:

{__name__=~".+", job=~"traefik|cadvisor|prometheus"}

Website uptime with blackbox-exporter:

# https://www.robustperception.io/what-percentage-of-time-is-my-service-down-for
avg_over_time(probe_success{job="node"}[15m]) * 100

Remove / Replace label:

Client and Server Request Metrics

Client Request Counts:

irate(http_client_requests_seconds_count{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m])

Client Response Time:

irate(http_client_requests_seconds_sum{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m]) /
irate(http_client_requests_seconds_count{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m])

Requests per Second:

sum(increase(http_server_requests_seconds_count{service="my-service", env="dev"}[1m])) by (uri)

is the same as:

sum(rate(http_server_requests_seconds_count{service="my-service", env="dev"}[1m]) * 60 ) by (uri)

See this SO thread for more details.

p95 Request Latencies with histogram_quantile (the latency experienced by the slowest 5% of requests in seconds):

histogram_quantile(0.95, sum by (le, store) (rate(myapp_latency_seconds_bucket{application="product-service", category=~".+"}[5m])))
Resource Requests and Limits (Kubernetes)

For CPU: average rate of CPU usage over 15 minutes:

rate(container_cpu_usage_seconds_total{job="kubelet",container="my-application"}[15m])

For Memory: shows usage in MB:

container_memory_usage_bytes{job="kubelet",container="my-application"} / (1024 * 1024)
Prometheus Scrape Configurations Relabeling Configurations

Example of relabeling to set the `instance` label:

scrape_configs:
  - job_name: 'multipass-nodes'
    static_configs:
    - targets: ['ip-192-168-64-29.multipass:9100']
      labels:
        env: test
    - targets: ['ip-192-168-64-30.multipass:9100']
      labels:
        env: test
    relabel_configs:
    - source_labels: [__address__]
      separator: ':'
      regex: '(.*):(.*)'
      replacement: '${1}'
      target_label: instance
Static Configurations

Scraping a static list of targets:

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
         - targets: ['localhost:9090']
      labels:
        region: 'eu-west-1'
Service Discovery (DNS SD)

Using DNS service discovery for MySQL exporter:

scrape_configs:
  - job_name: 'mysql-exporter'
    scrape_interval: 5s
    dns_sd_configs:
    - names:
      - 'tasks.mysql-exporter'
      type: 'A'
      port: 9104
    relabel_configs:
    - source_labels: [__address__]
      regex: '.*'
      target_label: instance
      replacement: 'mysqld-exporter'
Grafana with Prometheus

Customizing legend labels in Grafana:

# If your output is like: {instance="10.0.2.66:9100",job="node",nodename="rpi-02"}
# Use this in the Grafana "Legend" input:
{{nodename}}

Displaying specific labels:

# If you want to show 'exported_instance'
sum(exporter_memory_usage{exported_instance="myapp"})
# You might need to group by it:
sum by (exported_instance) (exporter_memory_usage{exported_instance="my_app"})
# Then in Grafana Legend:
{{exported_instance}}
Grafana Variables
  • Hostname Variable:

    Name: node

    Label: node

    Query: label_values(node_uname_info, nodename)

    Usage in query: node_uname_info{nodename=~"$node"}

  • Node Exporter Address Variable:

    Type: Query

    Query: label_values(node_network_up, instance)

  • MySQL Exporter Address Variable:

    Type: Query

    Query: label_values(mysql_up, instance)

  • Static Values Variable:

    Type: Custom

    Name: dc

    Label: dc

    Values: eu-west-1a,eu-west-1b,eu-west-1c

  • Docker Swarm Stack Names Variable:

    Name: stack

    Label: stack

    Query: label_values(container_last_seen,container_label_com_docker_stack_namespace)

  • Docker Swarm Service Names Variable:

    Name: service_name

    Label: service_name

    Query: label_values(container_last_seen,container_label_com_docker_swarm_service_name)

  • Docker Swarm Manager NodeId Variable:

    Name: manager_node_id

    Label: manager_node_id

    Query: label_values(container_last_seen{container_label_com_docker_swarm_service_name=~"proxy_traefik", container_label_com_docker_swarm_node_id=~".+"}, container_label_com_docker_swarm_node_id)

  • Docker Swarm Stacks Running on Managers Variable:

    Name: stack_on_manager

    Label: stack_on_manager

    Query: label_values(container_last_seen{container_label_com_docker_swarm_node_id=~"$manager_node_id"},container_label_com_docker_stack_namespace)

Prometheus Recording Rules Application Instrumentation Python Flask External Prometheus Resources