- Prometheus Metric Types and Basics
- Curated Examples by Exporter
- PromQL Example Queries
- Prometheus Scrape Configurations
- Prometheus Recording Rules
- External Prometheus Resources
Counter: A metric that only increases. Useful for tracking counts of events like requests or errors.Gauge: A metric that can increase or decrease. Ideal for measuring current values like memory usage or temperature.Histogram: A metric that samples observations (e.g., request durations) and counts them in configurable buckets. Also provides a sum and count of all observations.Summary: Similar to a histogram, but calculates quantiles on the client side. Less common than histograms in modern Prometheus setups.- Source and Statistics 101
rate: Calculates the per-second average rate of increase for a counter over a specified time window.irate: Calculates the *instantaneous* per-second rate of increase for a counter, using only the last two data points. Best for volatile counters.increase: Calculates the total increase of a counter over a specified time frame.resets: Counts the number of times a counter has been reset within a given time window.
Example queries categorized by common exporters:
- Node Exporter Metrics
Show me all the metric names for the job=app:
group ({job="app"}) by (__name__)
How many nodes are up?
up
Count targets per job:
count by (job) (up)
Combining values from 2 different vectors (Hostname with a Metric):
up * on(instance) group_left(nodename) (node_uname_info)
Exclude labels:
sum without(job) (up * on(instance) group_left(nodename) (node_uname_info))
Amount of Memory Available:
node_memory_MemAvailable_bytes
Amount of Memory Available in MB:
node_memory_MemAvailable_bytes / 1024 / 1024
Amount of Memory Available in MB 10 minutes ago:
node_memory_MemAvailable_bytes / 1024 / 1024 offset 10m
Average Memory Available for Last 5 Minutes:
avg_over_time(node_memory_MemFree_bytes[5m]) / 1024 / 1024
Memory Usage in Percent:
100 * (1 - ((avg_over_time(node_memory_MemFree_bytes[10m]) + avg_over_time(node_memory_Cached_bytes[10m]) + avg_over_time(node_memory_Buffers_bytes[10m])) / avg_over_time(node_memory_MemTotal_bytes[10m])))
CPU Utilization:
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 )
CPU Utilization offset with 24 hours ago:
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m] offset 24h)) * 100 )
CPU Utilization per Core:
( (1 - rate(node_cpu_seconds_total{job="node-exporter", mode="idle", instance="$instance"}[$__interval])) / ignoring(cpu) group_left() count without (cpu)( node_cpu_seconds_total{job="node-exporter", mode="idle", instance="$instance"}) )
CPU Utilization by Node:
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[10m]) * 100) * on(instance) group_left(nodename) (node_uname_info))
Memory Available by Node:
node_memory_MemAvailable_bytes * on(instance) group_left(nodename) (node_uname_info)
Or if you rely on labels from other metrics:
(node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemFree_bytes{job="node-exporter"} - node_memory_Buffers_bytes{job="node-exporter"} - node_memory_Cached_bytes{job="node-exporter"}) * on(instance) group_left(nodename) (node_uname_info{nodename=~"$nodename"})
Load Average in percentage:
avg(node_load1{instance=~"$name", job=~"$job"}) / count(count(node_cpu_seconds_total{instance=~"$name", job=~"$job"}) by (cpu)) * 100
Load Average per Instance:
sum(node_load5{}) by (instance) / count(node_cpu_seconds_total{mode="user"}) by (instance) * 100
Load Average (average per instance_id: lets say the metric has 2 identical label values but are different):
avg by (instance_id, instance) (node_load1{job=~"node-exporter", aws_environment="dev", instance="debug-dev"})
Disk Available by Node:
node_filesystem_free_bytes{mountpoint="/"} * on(instance) group_left(nodename) (node_uname_info)
Disk IO per Node: Outbound:
sum(rate(node_disk_read_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)
Disk IO per Node: Inbound:
sum(rate(node_disk_written_bytes_total{job="node"}[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)
Network IO per Node:
sum(rate(node_network_receive_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)
sum(rate(node_network_transmit_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)
Process Restarts:
changes(process_start_time_seconds{job=~".+"}[15m])
Container Cycling:
(time() - container_start_time_seconds{job=~".+"}) < 60
Histogram Quantile (e.g., 95th percentile request duration):
histogram_quantile(0.95, sum(rate(prometheus_http_request_duration_seconds_bucket[5m])) by (le, handler)) * 1e3
Metrics 24 hours ago (useful for comparison):
# query a
total_number_of_errors{instance="my-instance", region="eu-west-1"}
# query b
total_number_of_errors{instance="my-instance", region="eu-west-1"} offset 24h
Number of Nodes (Up):
count(up{job="cadvisor_my-swarm"})
Running Containers per Node:
count(container_last_seen) BY (container_label_com_docker_swarm_node_id)
Running Containers per Node, include corresponding hostnames:
count(container_last_seen) BY (container_label_com_docker_swarm_node_id) * ON (container_label_com_docker_swarm_node_id) GROUP_LEFT(node_name) node_meta
HAProxy Response Codes:
haproxy_server_http_responses_total{backend=~"$backend", server=~"$server", code=~"$code", alias=~"$alias"} > 0
Metrics with the most resources (time series count):
topk(10, count by (__name__)({__name__=~".+"}))
Top 10 metrics per job:
topk(10, count by (__name__, job)({__name__=~".+"}))
Jobs with the most time series:
topk(10, count by (job)({__name__=~".+"}))
Top 5 per value (e.g., highest costs):
sort_desc(topk(5, aws_service_costs))
Table - Top 5 (enable instant as well):
sort(topk(5, aws_service_costs))
Most metrics per job, sorted:
sort_desc (sum by (job) (count by (__name__, job)({job=~".+"})))
Group per Day (Table) - WIP
aws_service_costs{service=~"$service"} + ignoring(year, month, day) group_right
count_values without() ("year", year(timestamp(
count_values without() ("month", month(timestamp(
count_values without() ("day", day_of_month(timestamp(
aws_service_costs{service=~"$service"}
)))
)))
))) * 0
Group Metrics per node hostname:
node_memory_MemAvailable_bytes * on(instance) group_left(nodename) (node_uname_info)
Subtract two gauge metrics (exclude the label that doesn't match):
polkadot_block_height{instance="polkadot", chain=~"$chain", status="sync_target"} - ignoring(status) polkadot_block_height{instance="polkadot", chain=~"$chain", status="finalized"}
Conditional joins when labels exist:
(
# For all sensors that have a name (label "label"), join them with `node_hwmon_sensor_label` to get that name.
(node_hwmon_temp_celsius * ignoring(label) group_left(label) node_hwmon_sensor_label)
or
# For all sensors that do NOT have a name (label "label") in `node_hwmon_sensor_label`, assign them `label="unknown-sensor-name"`.
(label_replace((node_hwmon_temp_celsius unless ignoring(label) node_hwmon_sensor_label), "label", "unknown-sensor-name", "", ".*"))
)
Container CPU Average for 5m:
(sum by(instance, container_label_com_amazonaws_ecs_container_name, container_label_com_amazonaws_ecs_cluster) (rate(container_cpu_usage_seconds_total[5m])) * 100)
Container Memory Usage: Total:
sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"})
Container Memory, per Task, Node:
sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_swarm_task_name, container_label_com_docker_swarm_node_id)
Container Memory per Node:
sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_swarm_node_id)
Memory Usage per Stack:
sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_stack_namespace)
Remove metrics from results that do not contain a specific label:
container_cpu_usage_seconds_total{container_label_com_amazonaws_ecs_cluster!=""}
Remove labels from a metric:
sum without (age, country) (people_metrics)
View top 10 biggest metrics by name:
topk(10, count by (__name__)({__name__=~".+"}))
View top 10 biggest metrics by name, job:
topk(10, count by (__name__, job)({__name__=~".+"}))
View all metrics for a specific job:
{__name__=~".+", job="node-exporter"}
View all metrics for more than one job using vector selectors:
{__name__=~".+", job=~"traefik|cadvisor|prometheus"}
Website uptime with blackbox-exporter:
# https://www.robustperception.io/what-percentage-of-time-is-my-service-down-for
avg_over_time(probe_success{job="node"}[15m]) * 100
Remove / Replace label:
Client Request Counts:
irate(http_client_requests_seconds_count{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m])
Client Response Time:
irate(http_client_requests_seconds_sum{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m]) /
irate(http_client_requests_seconds_count{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m])
Requests per Second:
sum(increase(http_server_requests_seconds_count{service="my-service", env="dev"}[1m])) by (uri)
is the same as:
sum(rate(http_server_requests_seconds_count{service="my-service", env="dev"}[1m]) * 60 ) by (uri)
See this SO thread for more details.
p95 Request Latencies with histogram_quantile (the latency experienced by the slowest 5% of requests in seconds):
histogram_quantile(0.95, sum by (le, store) (rate(myapp_latency_seconds_bucket{application="product-service", category=~".+"}[5m])))
For CPU: average rate of CPU usage over 15 minutes:
rate(container_cpu_usage_seconds_total{job="kubelet",container="my-application"}[15m])
For Memory: shows usage in MB:
container_memory_usage_bytes{job="kubelet",container="my-application"} / (1024 * 1024)
Example of relabeling to set the `instance` label:
scrape_configs:
- job_name: 'multipass-nodes'
static_configs:
- targets: ['ip-192-168-64-29.multipass:9100']
labels:
env: test
- targets: ['ip-192-168-64-30.multipass:9100']
labels:
env: test
relabel_configs:
- source_labels: [__address__]
separator: ':'
regex: '(.*):(.*)'
replacement: '${1}'
target_label: instance
- Full Relabeling Example
- Prometheus Relabeling Tricks
- Relabel Rules and Action Parameter
- Relabel Configs vs Metric Relabel Configs
Scraping a static list of targets:
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
labels:
region: 'eu-west-1'
Using DNS service discovery for MySQL exporter:
scrape_configs:
- job_name: 'mysql-exporter'
scrape_interval: 5s
dns_sd_configs:
- names:
- 'tasks.mysql-exporter'
type: 'A'
port: 9104
relabel_configs:
- source_labels: [__address__]
regex: '.*'
target_label: instance
replacement: 'mysqld-exporter'
Customizing legend labels in Grafana:
# If your output is like: {instance="10.0.2.66:9100",job="node",nodename="rpi-02"}
# Use this in the Grafana "Legend" input:
{{nodename}}
Displaying specific labels:
# If you want to show 'exported_instance'
sum(exporter_memory_usage{exported_instance="myapp"})
# You might need to group by it:
sum by (exported_instance) (exporter_memory_usage{exported_instance="my_app"})
# Then in Grafana Legend:
{{exported_instance}}
- Hostname Variable:
Name:
nodeLabel:
nodeQuery:
label_values(node_uname_info, nodename)Usage in query:
node_uname_info{nodename=~"$node"} - Node Exporter Address Variable:
Type:
QueryQuery:
label_values(node_network_up, instance) - MySQL Exporter Address Variable:
Type:
QueryQuery:
label_values(mysql_up, instance) - Static Values Variable:
Type:
CustomName:
dcLabel:
dcValues:
eu-west-1a,eu-west-1b,eu-west-1c - Docker Swarm Stack Names Variable:
Name:
stackLabel:
stackQuery:
label_values(container_last_seen,container_label_com_docker_stack_namespace) - Docker Swarm Service Names Variable:
Name:
service_nameLabel:
service_nameQuery:
label_values(container_last_seen,container_label_com_docker_swarm_service_name) - Docker Swarm Manager NodeId Variable:
Name:
manager_node_idLabel:
manager_node_idQuery:
label_values(container_last_seen{container_label_com_docker_swarm_service_name=~"proxy_traefik", container_label_com_docker_swarm_node_id=~".+"}, container_label_com_docker_swarm_node_id) - Docker Swarm Stacks Running on Managers Variable:
Name:
stack_on_managerLabel:
stack_on_managerQuery:
label_values(container_last_seen{container_label_com_docker_swarm_node_id=~"$manager_node_id"},container_label_com_docker_stack_namespace)
- Prometheus Querying Basics
- PromQL Tutorial for Beginners
- Prometheus 101
- Section.io: Prometheus Querying
- InnoQ: Prometheus Counters
- RobustPerception: Biggest Metrics
- Top Metrics Discussion
- Ordina-Jworks: Monitoring with Prometheus
- Infinity Works: Example Queries
- Prometheus Relabeling Tricks
- Timber: PromQL for Humans
- RobustPerception: Machine CPU Usage
- RobustPerception: Common Query Patterns
- RobustPerception: Website Uptime
- RobustPerception: Prometheus Histogram
- RobustPerception: Prometheus Counter
- RobustPerception: Prometheus Gauge
- RobustPerception: Prometheus Summary
- DevConnected: Definitive Guide to Prometheus
- @showmax Prometheus Introduction
- Rancher: Cluster Monitoring
- Prometheus CPU Stats
- AWS Prometheus Rewrite Rules for k8s
- Prometheus AWS Cross Account ec2_sd_config
- Prometheus AWS ec2_sd_config Role
- Fabianlee: Kubernetes Scrape Configs
- MetricFire: Understanding the Rate Function
- Alerting on Missing Labels and Metrics
- @devconnected Disk IO Dashboarding
- CPU and Memory Requests
- Prometheus Counter Metrics
- last9.io PromQL Cheatsheet
- Simulating AWS Tags in Local Prometheus