Prometheus & Grafana: The Complete Monitoring Guide for 2026
You cannot fix what you cannot measure. Prometheus and Grafana form the backbone of modern infrastructure monitoring — Prometheus collects and stores metrics as time-series data, while Grafana turns those numbers into actionable dashboards and alerts. Together they give you full visibility into your applications, servers, databases, and Kubernetes clusters.
This guide covers both tools from installation through production deployment, including PromQL queries, alerting rules, dashboard patterns, client libraries for Node.js, Python, and Go, and battle-tested best practices.
Table of Contents
- What is Prometheus? Architecture Overview
- Installing Prometheus
- Prometheus Configuration
- PromQL Fundamentals
- PromQL Advanced Queries
- Alerting Rules and Alertmanager
- Grafana: Installation and Setup
- Building Grafana Dashboards
- Dashboard Patterns: RED and USE Methods
- Monitoring Applications (Node.js, Python, Go)
- Service Discovery
- Best Practices
- Frequently Asked Questions
1. What is Prometheus? Architecture Overview
Prometheus is an open-source monitoring toolkit originally built at SoundCloud and now a graduated CNCF project. It stores all data as time series — streams of timestamped values identified by a metric name and key-value label pairs.
- Pull-based collection — scrapes metrics from HTTP endpoints (
/metrics) at configured intervals - Time-series database (TSDB) — purpose-built storage optimized for append-heavy workloads
- PromQL — powerful query language for slicing, aggregating, and transforming metrics
- Alertmanager — handles alert deduplication, grouping, routing to Slack/PagerDuty/email
- Service discovery — finds targets via Kubernetes, Consul, DNS, or file-based configs
- Exporters — adapters exposing metrics from third-party systems (databases, hardware)
Four Metric Types
# Counter: cumulative, only goes up (resets on restart)
http_requests_total{method="GET", status="200"} 14832
# Gauge: current value, goes up and down
node_memory_available_bytes 4294967296
# Histogram: counts observations in configurable buckets
http_request_duration_seconds_bucket{le="0.1"} 9823
http_request_duration_seconds_bucket{le="0.5"} 12340
http_request_duration_seconds_bucket{le="1.0"} 12890
http_request_duration_seconds_count 12900
http_request_duration_seconds_sum 3542.8
# Summary: calculates quantiles client-side (histograms preferred)
2. Installing Prometheus
Docker (Quickest Start)
services:
prometheus:
image: prom/prometheus:v2.51.0
ports: ["9090:9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
restart: unless-stopped
volumes:
prometheus-data:
Binary Installation
# Download and extract
wget https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz
tar xvf prometheus-2.51.0.linux-amd64.tar.gz
cd prometheus-2.51.0.linux-amd64 && ./prometheus --config.file=prometheus.yml
# Systemd service for production
sudo tee /etc/systemd/system/prometheus.service <<EOF
[Unit]
Description=Prometheus
After=network-online.target
[Service]
User=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=30d
Restart=always
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload && sudo systemctl enable --now prometheus
Kubernetes (Helm)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set grafana.adminPassword=your-secure-password
3. Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- 'alert-rules.yml'
- 'recording-rules.yml'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'my-api'
metrics_path: /metrics
static_configs:
- targets: ['api:3000']
labels:
environment: production
team: backend
- job_name: 'web-servers'
static_configs:
- targets: ['web-1:8080', 'web-2:8080', 'web-3:8080']
labels:
cluster: us-east
- job_name: 'secure-app'
bearer_token_file: /etc/prometheus/token
tls_config:
ca_file: /etc/prometheus/ca.pem
static_configs:
- targets: ['secure-app:443']
Reload config without restart:
kill -HUP $(pidof prometheus)
# Or: curl -X POST http://localhost:9090/-/reload (requires --web.enable-lifecycle)
4. PromQL Fundamentals
Selectors and Matchers
# Instant vector: current value
http_requests_total
http_requests_total{method="GET"} # exact match
http_requests_total{status=~"5.."} # regex match
http_requests_total{handler!="/health"} # not equal
http_requests_total{method=~"GET|POST"} # regex OR
# Range vector: values over a time window
http_requests_total{method="GET"}[5m] # last 5 minutes
node_cpu_seconds_total[1h] # last 1 hour
Operators
# Arithmetic
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Comparison (filters results)
http_requests_total > 1000
node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
Aggregations
sum(rate(http_requests_total[5m])) # total request rate
sum by (method) (rate(http_requests_total[5m])) # rate grouped by method
sum by (status) (rate(http_requests_total[5m])) # rate grouped by status
avg(node_cpu_seconds_total{mode="idle"}) # average CPU idle
max by (instance) (process_resident_memory_bytes) # max memory per instance
count(up == 1) # how many targets are up
topk(5, rate(http_requests_total[5m])) # top 5 busiest endpoints
bottomk(3, node_filesystem_avail_bytes) # 3 fullest filesystems
5. PromQL Advanced Queries
rate(), irate(), increase()
# rate(): per-second average rate over a range (smooth, recommended)
rate(http_requests_total[5m])
# irate(): instant rate using last two data points (spiky)
irate(http_requests_total[5m])
# RULE: always use rate() on counters before aggregating
# WRONG: sum(http_requests_total)
# RIGHT: sum(rate(http_requests_total[5m]))
# increase(): total increase over a range (for counters)
increase(http_requests_total[1h]) # total requests in last hour
histogram_quantile()
# 95th percentile request duration
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# 99th percentile grouped by endpoint
histogram_quantile(0.99,
sum by (handler, le) (rate(http_request_duration_seconds_bucket[5m]))
)
# Median (50th percentile)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
Recording Rules
# recording-rules.yml - precompute expensive queries
groups:
- name: api_metrics
interval: 15s
rules:
- record: job:http_requests:rate5m
expr: sum by (job, handler) (rate(http_requests_total[5m]))
- record: job:http_duration_seconds:p95
expr: histogram_quantile(0.95, sum by (job, handler, le) (rate(http_request_duration_seconds_bucket[5m])))
- record: job:http_errors:ratio5m
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m]))
6. Alerting Rules and Alertmanager
Alert Rules
# alert-rules.yml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.job }}"
description: "Error rate is {{ $value | humanizePercentage }} for 5 min."
- alert: HighLatency
expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High p95 latency on {{ $labels.job }}"
- name: infrastructure
rules:
- alert: HostHighCpuUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
- alert: HostLowDiskSpace
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.10
for: 5m
labels:
severity: critical
annotations:
summary: "Disk below 10% on {{ $labels.instance }}"
- alert: TargetDown
expr: up == 0
for: 3m
labels:
severity: critical
annotations:
summary: "Target {{ $labels.instance }} is down"
Alertmanager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
receiver: 'slack-default'
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match: { severity: critical }
receiver: 'pagerduty-critical'
- match: { severity: warning }
receiver: 'slack-warnings'
receivers:
- name: 'slack-default'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warnings'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
inhibit_rules:
- source_match: { severity: critical }
target_match: { severity: warning }
equal: ['alertname', 'instance']
7. Grafana: Installation and Setup
Grafana is the visualization layer — it connects to Prometheus and dozens of other data sources to create interactive dashboards.
Full Docker Compose Monitoring Stack
services:
prometheus:
image: prom/prometheus:v2.51.0
ports: ["9090:9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
- prom-data:/prometheus
command: ['--config.file=/etc/prometheus/prometheus.yml', '--storage.tsdb.retention.time=30d', '--web.enable-lifecycle']
restart: unless-stopped
grafana:
image: grafana/grafana:10.4.0
ports: ["3000:3000"]
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
depends_on: [prometheus]
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.27.0
ports: ["9093:9093"]
volumes: [./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro]
restart: unless-stopped
node-exporter:
image: prom/node-exporter:v1.7.0
ports: ["9100:9100"]
volumes: ['/proc:/host/proc:ro', '/sys:/host/sys:ro', '/:/rootfs:ro']
command: ['--path.procfs=/host/proc', '--path.sysfs=/host/sys', '--path.rootfs=/rootfs']
restart: unless-stopped
volumes:
prom-data:
grafana-data:
Auto-Provision the Prometheus Datasource
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
8. Building Grafana Dashboards
Panel types and when to use them:
- Time series — line/area charts for metrics over time (request rate, latency, CPU)
- Stat — single big number (uptime, total requests today)
- Gauge — dial showing value against a range (disk usage %, CPU %)
- Table — tabular breakdowns (top endpoints by latency)
- Heatmap — histogram distribution over time (request duration buckets)
Template Variables
# Variable "instance" - Type: Query
# Query: label_values(up{job="my-api"}, instance)
# Populates a dropdown with all instances
# Variable "handler" - Type: Query
# Query: label_values(http_requests_total{instance="$instance"}, handler)
# Filters based on selected instance
# Use in panel queries:
rate(http_requests_total{instance="$instance", handler="$handler"}[5m])
# Multi-value (select multiple):
rate(http_requests_total{instance=~"$instance"}[5m])
Dashboard Provisioning
# grafana/provisioning/dashboards/provider.yml
apiVersion: 1
providers:
- name: 'default'
folder: 'Provisioned'
type: file
options:
path: /etc/grafana/provisioning/dashboards/json
9. Dashboard Patterns: RED and USE Methods
RED Method (for Services)
Rate, Errors, Duration — the three golden signals for request-driven services:
# RATE: requests per second
sum(rate(http_requests_total[5m]))
# ERRORS: error percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# DURATION: p50, p95, p99 latency
histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
USE Method (for Resources)
Utilization, Saturation, Errors — for CPU, memory, disk, and network:
# CPU Utilization
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory Utilization
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk Utilization
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
# Network throughput (bytes/sec)
rate(node_network_receive_bytes_total{device="eth0"}[5m])
rate(node_network_transmit_bytes_total{device="eth0"}[5m])
# CPU Saturation (load vs CPU count)
node_load1 / count without (cpu) (node_cpu_seconds_total{mode="idle"})
10. Monitoring Applications (Node.js, Python, Go)
Node.js (prom-client)
const express = require('express');
const client = require('prom-client'); // npm install prom-client
client.collectDefaultMetrics({ prefix: 'app_' });
const httpDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests',
labelNames: ['method', 'handler', 'status'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});
const httpTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'handler', 'status']
});
const app = express();
app.use((req, res, next) => {
const end = httpDuration.startTimer();
res.on('finish', () => {
const labels = { method: req.method, handler: req.route?.path || req.path, status: res.statusCode };
end(labels);
httpTotal.inc(labels);
});
next();
});
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});
Python (prometheus_client)
from flask import Flask, request
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import time
app = Flask(__name__)
REQUEST_COUNT = Counter('http_requests_total', 'Total requests', ['method', 'handler', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'Request duration',
['method', 'handler'], buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5])
@app.before_request
def start_timer():
request._start_time = time.time()
@app.after_request
def record_metrics(response):
duration = time.time() - request._start_time
handler = request.endpoint or 'unknown'
REQUEST_DURATION.labels(request.method, handler).observe(duration)
REQUEST_COUNT.labels(request.method, handler, response.status_code).inc()
return response
@app.route('/metrics')
def metrics():
return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
Go (client_golang)
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpTotal = prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "http_requests_total", Help: "Total HTTP requests",
}, []string{"method", "handler", "status"})
httpDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "http_request_duration_seconds", Help: "Request duration",
Buckets: []float64{0.01, 0.05, 0.1, 0.5, 1, 2, 5},
}, []string{"method", "handler"})
)
func init() { prometheus.MustRegister(httpTotal, httpDuration) }
func main() {
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/api/users", instrumentHandler("/api/users", usersHandler))
http.ListenAndServe(":8080", nil)
}
11. Service Discovery
Kubernetes Pod Discovery
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Annotate your Kubernetes pods:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
File-Based Discovery
# prometheus.yml
scrape_configs:
- job_name: 'file-targets'
file_sd_configs:
- files: ['/etc/prometheus/targets/*.json']
refresh_interval: 30s
# /etc/prometheus/targets/web-servers.json
[
{
"targets": ["web-1:8080", "web-2:8080"],
"labels": { "env": "production", "team": "frontend" }
}
]
Consul Service Discovery
scrape_configs:
- job_name: 'consul-services'
consul_sd_configs:
- server: 'consul:8500'
services: []
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: .*,prometheus,.*
action: keep
12. Best Practices
Labeling and Cardinality
# GOOD: low-cardinality labels that enable aggregation
http_requests_total{method="GET", status="200", handler="/api/users"}
# BAD: high-cardinality labels explode storage
http_requests_total{user_id="12345", request_id="abc-def-ghi"}
# Every unique label combination = a new time series!
# Rule: if a label has >100 unique values, reconsider
# BAD: user_id, email, IP, request_id
# GOOD: method, status, handler, environment, region
# Check cardinality:
prometheus_tsdb_head_series # total series count
topk(10, count by (__name__) ({__name__=~".+"})) # series per metric
Retention and Storage
# Set retention
--storage.tsdb.retention.time=30d # time-based (default: 15d)
--storage.tsdb.retention.size=50GB # size-based cap
# Estimate: ~1-2 bytes per sample
# 1000 series * 15s interval * 30 days = ~200-350 MB
# Long-term storage options (remote write):
# Thanos, Cortex, Mimir, VictoriaMetrics
High Availability
# Run two identical Prometheus instances scraping the same targets
# They independently collect the same data
# Alertmanager handles deduplication of alerts from both
# For queries, use Thanos Querier to deduplicate and merge results
# Alertmanager HA: run a cluster
alertmanager --cluster.peer=am-1:9094 --cluster.peer=am-2:9094
Naming Conventions
# Follow Prometheus naming best practices:
# - snake_case, include unit suffix (_seconds, _bytes, _total)
# - Counters end with _total, use base units
# GOOD # BAD
http_request_duration_seconds # httpRequestDuration (camelCase)
http_requests_total # request_count (no _total)
node_memory_available_bytes # memory_megabytes (not base unit)
Frequently Asked Questions
What is the difference between Prometheus and Grafana?
Prometheus is a time-series database and monitoring system that collects and stores metrics by scraping HTTP endpoints. It includes PromQL for querying and built-in alerting. Grafana is a visualization platform that connects to data sources like Prometheus to create dashboards, charts, and alerts with a rich UI. Prometheus handles data collection, storage, and alerting logic, while Grafana provides the visual layer for exploring and displaying that data. Most monitoring stacks use both together.
How does Prometheus collect metrics from applications?
Prometheus uses a pull-based model. Your application exposes an HTTP endpoint (typically /metrics) that returns metrics in Prometheus exposition format. Prometheus periodically scrapes this endpoint at a configured interval (e.g., every 15 seconds). The pull model means Prometheus controls the scrape rate, can detect when targets are down (failed scrapes), and applications do not need to know about the monitoring infrastructure. For short-lived jobs, Prometheus provides a Pushgateway that accepts pushed metrics.
What are the four Prometheus metric types?
Counter (cumulative value that only increases, e.g., total HTTP requests), Gauge (value that goes up and down, e.g., memory usage), Histogram (samples observations in configurable buckets for request durations), and Summary (calculates quantiles client-side). Histograms are generally preferred over summaries because they can be aggregated across instances.
How do I set up alerting with Prometheus and Alertmanager?
Define alerting rules in Prometheus with PromQL conditions and a for duration (how long the condition must be true before firing). Configure Alertmanager to receive these alerts and route them to Slack, PagerDuty, email, or webhooks. Alertmanager handles deduplication, grouping, silencing, and inhibition. Define rules in a separate YAML file referenced from prometheus.yml, and point Prometheus to the Alertmanager endpoint.
Can Prometheus monitor Kubernetes clusters?
Yes, Prometheus is the standard monitoring solution for Kubernetes. It supports native Kubernetes service discovery, automatically finding and scraping pods, services, and nodes using annotations. Deploy with the kube-prometheus-stack Helm chart for Prometheus, Grafana, Alertmanager, and pre-configured dashboards. Any pod with the annotation prometheus.io/scrape: "true" is automatically discovered and scraped.
Conclusion
Prometheus and Grafana give you the observability foundation every production system needs. Start with the basics: install Prometheus, add a node exporter for host metrics, instrument your application with a client library, and build a simple RED dashboard in Grafana. Layer in alerting rules, recording rules, and service discovery as your infrastructure grows.
The key to effective monitoring is starting simple and iterating. A single dashboard with request rate, error rate, and p95 latency tells you more about your application's health than a hundred unread metrics. Measure what matters, alert on what is actionable, and let Grafana make it visible.
Learn More
- Docker Compose: The Complete Guide — deploy your monitoring stack with Compose
- Kubernetes: The Complete Guide — orchestrate containers at scale
- GitHub Actions CI/CD Guide — automate builds, tests, and deployments
- GitHub Merge Queue Escalation Decision Cutoff for Repeated ACK Breaches Guide — monitoring-driven authority-transfer policy when ACK timeout alerts keep recurring
- GitHub Merge Queue Post-Expiry Reopen Criteria Guide — use objective reopen gates and auto re-freeze triggers after expiry-default actions
- Redis: The Complete Guide — caching, pub/sub, and data structures