Prometheus & Grafana: The Complete Monitoring Guide for 2026

Published February 12, 2026 · 30 min read

You cannot fix what you cannot measure. Prometheus and Grafana form the backbone of modern infrastructure monitoring — Prometheus collects and stores metrics as time-series data, while Grafana turns those numbers into actionable dashboards and alerts. Together they give you full visibility into your applications, servers, databases, and Kubernetes clusters.

This guide covers both tools from installation through production deployment, including PromQL queries, alerting rules, dashboard patterns, client libraries for Node.js, Python, and Go, and battle-tested best practices.

⚙ Related: Deploy your monitoring stack with Docker Compose and validate YAML configs with our YAML Validator.

Table of Contents

  1. What is Prometheus? Architecture Overview
  2. Installing Prometheus
  3. Prometheus Configuration
  4. PromQL Fundamentals
  5. PromQL Advanced Queries
  6. Alerting Rules and Alertmanager
  7. Grafana: Installation and Setup
  8. Building Grafana Dashboards
  9. Dashboard Patterns: RED and USE Methods
  10. Monitoring Applications (Node.js, Python, Go)
  11. Service Discovery
  12. Best Practices
  13. Frequently Asked Questions

1. What is Prometheus? Architecture Overview

Prometheus is an open-source monitoring toolkit originally built at SoundCloud and now a graduated CNCF project. It stores all data as time series — streams of timestamped values identified by a metric name and key-value label pairs.

Four Metric Types

# Counter: cumulative, only goes up (resets on restart)
http_requests_total{method="GET", status="200"} 14832

# Gauge: current value, goes up and down
node_memory_available_bytes 4294967296

# Histogram: counts observations in configurable buckets
http_request_duration_seconds_bucket{le="0.1"} 9823
http_request_duration_seconds_bucket{le="0.5"} 12340
http_request_duration_seconds_bucket{le="1.0"} 12890
http_request_duration_seconds_count 12900
http_request_duration_seconds_sum 3542.8

# Summary: calculates quantiles client-side (histograms preferred)

2. Installing Prometheus

Docker (Quickest Start)

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    restart: unless-stopped
volumes:
  prometheus-data:

Binary Installation

# Download and extract
wget https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz
tar xvf prometheus-2.51.0.linux-amd64.tar.gz
cd prometheus-2.51.0.linux-amd64 && ./prometheus --config.file=prometheus.yml

# Systemd service for production
sudo tee /etc/systemd/system/prometheus.service <<EOF
[Unit]
Description=Prometheus
After=network-online.target
[Service]
User=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d
Restart=always
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload && sudo systemctl enable --now prometheus

Kubernetes (Helm)

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set grafana.adminPassword=your-secure-password

3. Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - 'alert-rules.yml'
  - 'recording-rules.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'my-api'
    metrics_path: /metrics
    static_configs:
      - targets: ['api:3000']
        labels:
          environment: production
          team: backend

  - job_name: 'web-servers'
    static_configs:
      - targets: ['web-1:8080', 'web-2:8080', 'web-3:8080']
        labels:
          cluster: us-east

  - job_name: 'secure-app'
    bearer_token_file: /etc/prometheus/token
    tls_config:
      ca_file: /etc/prometheus/ca.pem
    static_configs:
      - targets: ['secure-app:443']

Reload config without restart:

kill -HUP $(pidof prometheus)
# Or: curl -X POST http://localhost:9090/-/reload  (requires --web.enable-lifecycle)

4. PromQL Fundamentals

Selectors and Matchers

# Instant vector: current value
http_requests_total
http_requests_total{method="GET"}              # exact match
http_requests_total{status=~"5.."}             # regex match
http_requests_total{handler!="/health"}        # not equal
http_requests_total{method=~"GET|POST"}        # regex OR

# Range vector: values over a time window
http_requests_total{method="GET"}[5m]          # last 5 minutes
node_cpu_seconds_total[1h]                     # last 1 hour

Operators

# Arithmetic
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Comparison (filters results)
http_requests_total > 1000
node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1

Aggregations

sum(rate(http_requests_total[5m]))                          # total request rate
sum by (method) (rate(http_requests_total[5m]))             # rate grouped by method
sum by (status) (rate(http_requests_total[5m]))             # rate grouped by status
avg(node_cpu_seconds_total{mode="idle"})                    # average CPU idle
max by (instance) (process_resident_memory_bytes)           # max memory per instance
count(up == 1)                                              # how many targets are up
topk(5, rate(http_requests_total[5m]))                      # top 5 busiest endpoints
bottomk(3, node_filesystem_avail_bytes)                     # 3 fullest filesystems

5. PromQL Advanced Queries

rate(), irate(), increase()

# rate(): per-second average rate over a range (smooth, recommended)
rate(http_requests_total[5m])

# irate(): instant rate using last two data points (spiky)
irate(http_requests_total[5m])

# RULE: always use rate() on counters before aggregating
# WRONG: sum(http_requests_total)
# RIGHT: sum(rate(http_requests_total[5m]))

# increase(): total increase over a range (for counters)
increase(http_requests_total[1h])              # total requests in last hour

histogram_quantile()

# 95th percentile request duration
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 99th percentile grouped by endpoint
histogram_quantile(0.99,
  sum by (handler, le) (rate(http_request_duration_seconds_bucket[5m]))
)

# Median (50th percentile)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

Recording Rules

# recording-rules.yml - precompute expensive queries
groups:
  - name: api_metrics
    interval: 15s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job, handler) (rate(http_requests_total[5m]))
      - record: job:http_duration_seconds:p95
        expr: histogram_quantile(0.95, sum by (job, handler, le) (rate(http_request_duration_seconds_bucket[5m])))
      - record: job:http_errors:ratio5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m]))

6. Alerting Rules and Alertmanager

Alert Rules

# alert-rules.yml
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | humanizePercentage }} for 5 min."

      - alert: HighLatency
        expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High p95 latency on {{ $labels.job }}"

  - name: infrastructure
    rules:
      - alert: HostHighCpuUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"

      - alert: HostLowDiskSpace
        expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk below 10% on {{ $labels.instance }}"

      - alert: TargetDown
        expr: up == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Target {{ $labels.instance }} is down"

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  receiver: 'slack-default'
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match: { severity: critical }
      receiver: 'pagerduty-critical'
    - match: { severity: warning }
      receiver: 'slack-warnings'

receivers:
  - name: 'slack-default'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warnings'
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'

inhibit_rules:
  - source_match: { severity: critical }
    target_match: { severity: warning }
    equal: ['alertname', 'instance']

7. Grafana: Installation and Setup

Grafana is the visualization layer — it connects to Prometheus and dozens of other data sources to create interactive dashboards.

Full Docker Compose Monitoring Stack

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
      - prom-data:/prometheus
    command: ['--config.file=/etc/prometheus/prometheus.yml', '--storage.tsdb.retention.time=30d', '--web.enable-lifecycle']
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.4.0
    ports: ["3000:3000"]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    depends_on: [prometheus]
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    ports: ["9093:9093"]
    volumes: [./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro]
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.7.0
    ports: ["9100:9100"]
    volumes: ['/proc:/host/proc:ro', '/sys:/host/sys:ro', '/:/rootfs:ro']
    command: ['--path.procfs=/host/proc', '--path.sysfs=/host/sys', '--path.rootfs=/rootfs']
    restart: unless-stopped

volumes:
  prom-data:
  grafana-data:

Auto-Provision the Prometheus Datasource

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

8. Building Grafana Dashboards

Panel types and when to use them:

Template Variables

# Variable "instance" - Type: Query
# Query: label_values(up{job="my-api"}, instance)
# Populates a dropdown with all instances

# Variable "handler" - Type: Query
# Query: label_values(http_requests_total{instance="$instance"}, handler)
# Filters based on selected instance

# Use in panel queries:
rate(http_requests_total{instance="$instance", handler="$handler"}[5m])

# Multi-value (select multiple):
rate(http_requests_total{instance=~"$instance"}[5m])

Dashboard Provisioning

# grafana/provisioning/dashboards/provider.yml
apiVersion: 1
providers:
  - name: 'default'
    folder: 'Provisioned'
    type: file
    options:
      path: /etc/grafana/provisioning/dashboards/json

9. Dashboard Patterns: RED and USE Methods

RED Method (for Services)

Rate, Errors, Duration — the three golden signals for request-driven services:

# RATE: requests per second
sum(rate(http_requests_total[5m]))

# ERRORS: error percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# DURATION: p50, p95, p99 latency
histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

USE Method (for Resources)

Utilization, Saturation, Errors — for CPU, memory, disk, and network:

# CPU Utilization
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Utilization
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk Utilization
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

# Network throughput (bytes/sec)
rate(node_network_receive_bytes_total{device="eth0"}[5m])
rate(node_network_transmit_bytes_total{device="eth0"}[5m])

# CPU Saturation (load vs CPU count)
node_load1 / count without (cpu) (node_cpu_seconds_total{mode="idle"})

10. Monitoring Applications (Node.js, Python, Go)

Node.js (prom-client)

const express = require('express');
const client = require('prom-client');  // npm install prom-client

client.collectDefaultMetrics({ prefix: 'app_' });

const httpDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests',
  labelNames: ['method', 'handler', 'status'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});

const httpTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'handler', 'status']
});

const app = express();

app.use((req, res, next) => {
  const end = httpDuration.startTimer();
  res.on('finish', () => {
    const labels = { method: req.method, handler: req.route?.path || req.path, status: res.statusCode };
    end(labels);
    httpTotal.inc(labels);
  });
  next();
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

Python (prometheus_client)

from flask import Flask, request
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import time

app = Flask(__name__)

REQUEST_COUNT = Counter('http_requests_total', 'Total requests', ['method', 'handler', 'status'])
REQUEST_DURATION = Histogram('http_request_duration_seconds', 'Request duration',
    ['method', 'handler'], buckets=[0.01, 0.05, 0.1, 0.5, 1, 2, 5])

@app.before_request
def start_timer():
    request._start_time = time.time()

@app.after_request
def record_metrics(response):
    duration = time.time() - request._start_time
    handler = request.endpoint or 'unknown'
    REQUEST_DURATION.labels(request.method, handler).observe(duration)
    REQUEST_COUNT.labels(request.method, handler, response.status_code).inc()
    return response

@app.route('/metrics')
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

Go (client_golang)

package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpTotal = prometheus.NewCounterVec(prometheus.CounterOpts{
        Name: "http_requests_total", Help: "Total HTTP requests",
    }, []string{"method", "handler", "status"})

    httpDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
        Name: "http_request_duration_seconds", Help: "Request duration",
        Buckets: []float64{0.01, 0.05, 0.1, 0.5, 1, 2, 5},
    }, []string{"method", "handler"})
)

func init() { prometheus.MustRegister(httpTotal, httpDuration) }

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.HandleFunc("/api/users", instrumentHandler("/api/users", usersHandler))
    http.ListenAndServe(":8080", nil)
}

11. Service Discovery

Kubernetes Pod Discovery

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
# Annotate your Kubernetes pods:
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

File-Based Discovery

# prometheus.yml
scrape_configs:
  - job_name: 'file-targets'
    file_sd_configs:
      - files: ['/etc/prometheus/targets/*.json']
        refresh_interval: 30s

# /etc/prometheus/targets/web-servers.json
[
  {
    "targets": ["web-1:8080", "web-2:8080"],
    "labels": { "env": "production", "team": "frontend" }
  }
]

Consul Service Discovery

scrape_configs:
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul:8500'
        services: []
    relabel_configs:
      - source_labels: [__meta_consul_tags]
        regex: .*,prometheus,.*
        action: keep

12. Best Practices

Labeling and Cardinality

# GOOD: low-cardinality labels that enable aggregation
http_requests_total{method="GET", status="200", handler="/api/users"}

# BAD: high-cardinality labels explode storage
http_requests_total{user_id="12345", request_id="abc-def-ghi"}
# Every unique label combination = a new time series!

# Rule: if a label has >100 unique values, reconsider
# BAD:  user_id, email, IP, request_id
# GOOD: method, status, handler, environment, region

# Check cardinality:
prometheus_tsdb_head_series                                 # total series count
topk(10, count by (__name__) ({__name__=~".+"}))           # series per metric

Retention and Storage

# Set retention
--storage.tsdb.retention.time=30d     # time-based (default: 15d)
--storage.tsdb.retention.size=50GB    # size-based cap

# Estimate: ~1-2 bytes per sample
# 1000 series * 15s interval * 30 days = ~200-350 MB

# Long-term storage options (remote write):
# Thanos, Cortex, Mimir, VictoriaMetrics

High Availability

# Run two identical Prometheus instances scraping the same targets
# They independently collect the same data
# Alertmanager handles deduplication of alerts from both

# For queries, use Thanos Querier to deduplicate and merge results
# Alertmanager HA: run a cluster
alertmanager --cluster.peer=am-1:9094 --cluster.peer=am-2:9094

Naming Conventions

# Follow Prometheus naming best practices:
# - snake_case, include unit suffix (_seconds, _bytes, _total)
# - Counters end with _total, use base units

# GOOD                                    # BAD
http_request_duration_seconds              # httpRequestDuration (camelCase)
http_requests_total                        # request_count (no _total)
node_memory_available_bytes                # memory_megabytes (not base unit)

Frequently Asked Questions

What is the difference between Prometheus and Grafana?

Prometheus is a time-series database and monitoring system that collects and stores metrics by scraping HTTP endpoints. It includes PromQL for querying and built-in alerting. Grafana is a visualization platform that connects to data sources like Prometheus to create dashboards, charts, and alerts with a rich UI. Prometheus handles data collection, storage, and alerting logic, while Grafana provides the visual layer for exploring and displaying that data. Most monitoring stacks use both together.

How does Prometheus collect metrics from applications?

Prometheus uses a pull-based model. Your application exposes an HTTP endpoint (typically /metrics) that returns metrics in Prometheus exposition format. Prometheus periodically scrapes this endpoint at a configured interval (e.g., every 15 seconds). The pull model means Prometheus controls the scrape rate, can detect when targets are down (failed scrapes), and applications do not need to know about the monitoring infrastructure. For short-lived jobs, Prometheus provides a Pushgateway that accepts pushed metrics.

What are the four Prometheus metric types?

Counter (cumulative value that only increases, e.g., total HTTP requests), Gauge (value that goes up and down, e.g., memory usage), Histogram (samples observations in configurable buckets for request durations), and Summary (calculates quantiles client-side). Histograms are generally preferred over summaries because they can be aggregated across instances.

How do I set up alerting with Prometheus and Alertmanager?

Define alerting rules in Prometheus with PromQL conditions and a for duration (how long the condition must be true before firing). Configure Alertmanager to receive these alerts and route them to Slack, PagerDuty, email, or webhooks. Alertmanager handles deduplication, grouping, silencing, and inhibition. Define rules in a separate YAML file referenced from prometheus.yml, and point Prometheus to the Alertmanager endpoint.

Can Prometheus monitor Kubernetes clusters?

Yes, Prometheus is the standard monitoring solution for Kubernetes. It supports native Kubernetes service discovery, automatically finding and scraping pods, services, and nodes using annotations. Deploy with the kube-prometheus-stack Helm chart for Prometheus, Grafana, Alertmanager, and pre-configured dashboards. Any pod with the annotation prometheus.io/scrape: "true" is automatically discovered and scraped.

Conclusion

Prometheus and Grafana give you the observability foundation every production system needs. Start with the basics: install Prometheus, add a node exporter for host metrics, instrument your application with a client library, and build a simple RED dashboard in Grafana. Layer in alerting rules, recording rules, and service discovery as your infrastructure grows.

The key to effective monitoring is starting simple and iterating. A single dashboard with request rate, error rate, and p95 latency tells you more about your application's health than a hundred unread metrics. Measure what matters, alert on what is actionable, and let Grafana make it visible.

Learn More

Related Resources

Docker Compose: The Complete Guide
Services, networks, volumes, and production deployment
Kubernetes: The Complete Guide
Container orchestration, deployments, services, and scaling
GitHub Actions CI/CD Guide
Automate builds, tests, and deployments with workflows
Redis: The Complete Guide
Caching, pub/sub, streams, and data structures
Merge Queue Checks Keep Restarting
Observability-first SLO playbook for rollback requeue loops and queue invalidation spikes
Merge Queue Checks Timed Out or Cancelled
Monitoring-oriented runbook for timeout/cancel signals during rollback queue incidents
Merge Queue Required Check Name Mismatch Guide
Detect and resolve expected-check drift when rollback PRs wait forever for status.
Merge Queue Stale Review Dismissal Guide
Monitor and fix rollback approval invalidation loops caused by stale-review dismissal after queue updates.
Merge Queue Emergency Bypass Governance
Incident governance model with dual approvals, expiry, and restoration checkpoints for rollback bypass decisions.
Merge Queue Deny Extension vs Restore Baseline
Decision framework to deny weak extension requests and verify timely baseline restoration.
Merge Queue Appeal Outcome Closure Template
Add closure checkpoints and post-incident action tracking after appeal decisions.
Merge Queue Escalation Decision Cutoff Guide
Set measurable cutoff triggers and authority transfer rules after repeated ACK timeout alerts.
Merge Queue Post-Expiry Reopen Criteria Guide
Reopen merge queue intake safely after expiry enforcement using objective gates and owner sign-off.