DevOps

Monitoring Applications with Grafana

Mayur Dabhi
Mayur Dabhi
May 23, 2026
15 min read

Production applications fail in unexpected ways. CPU spikes silently, memory leaks gradually, API latency creeps up — and by the time users complain, the damage is already done. Grafana changes this dynamic by giving you real-time, queryable visibility into every layer of your stack through beautiful dashboards that turn raw numbers into actionable insights.

Grafana has become the de facto standard for open-source observability. Whether you're running a small side project on a single server or managing hundreds of microservices across multiple Kubernetes clusters, the same tool works — and works elegantly. In this guide, we'll go from zero to a fully functioning monitoring stack with Prometheus, Node Exporter, and Grafana.

Why Grafana?

Grafana has over 20 million users across 800,000+ organizations including Netflix, PayPal, eBay, and Bloomberg. It supports 150+ data source plugins — Prometheus, InfluxDB, Elasticsearch, MySQL, PostgreSQL, Loki, and more — making it the most versatile monitoring frontend available today.

What is Grafana?

Grafana is an open-source analytics and monitoring platform that specializes in visualizing time-series data. Originally created by Torkel Ödegaard in 2013 as a fork of Kibana, it has since evolved into an independent, full-featured observability suite with its own alert engine, data source ecosystem, and plugin marketplace.

Grafana itself doesn't collect or store metrics — it's a visualization layer. You point it at your data sources, write queries, and Grafana renders the results as panels arranged on dashboards. This separation of concerns is what makes it so powerful: swap Prometheus for InfluxDB or add both simultaneously, and Grafana handles it seamlessly.

App Server /metrics endpoint Node Exporter System metrics MySQL Exporter DB metrics Prometheus Scrapes & stores time-series data Grafana Dashboards & Alerts Alertmanager Slack / PagerDuty scrape query fire The Modern Observability Stack

Grafana + Prometheus monitoring architecture

Setting Up the Monitoring Stack

The fastest way to get a full monitoring stack running locally is Docker Compose. The following configuration wires up Grafana, Prometheus, and Node Exporter (for host system metrics) in under two minutes.

1

Create docker-compose.yml

Define the three core services — Grafana, Prometheus, and Node Exporter — with persistent volumes so data survives container restarts.

docker-compose.yml
version: '3.8'

services:
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=mysecretpassword
      - GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-piechart-panel
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=15d'
      - '--web.enable-lifecycle'

  node_exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'

volumes:
  grafana-data:
  prometheus-data:
2

Create prometheus.yml

Tell Prometheus what to scrape and how often. Each job_name becomes a label you can filter on in queries.

prometheus.yml
global:
  scrape_interval: 15s        # How often to scrape targets
  evaluation_interval: 15s    # How often to evaluate alert rules
  scrape_timeout: 10s

scrape_configs:
  # Prometheus monitors itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Host system metrics via Node Exporter
  - job_name: 'node_exporter'
    static_configs:
      - targets: ['node_exporter:9100']

  # Your application (must expose /metrics)
  - job_name: 'myapp'
    static_configs:
      - targets: ['host.docker.internal:8080']
    metrics_path: '/metrics'
    scrape_interval: 10s
3

Start the stack

Run docker compose up -d and open Grafana at http://localhost:3000. Log in with the credentials you set in the environment variables.

4

Add Prometheus as a data source

In Grafana: Connections → Data sources → Add new data source → Prometheus. Set the URL to http://prometheus:9090 and click Save & Test.

Installing on a Linux Server

For production on Ubuntu/Debian: apt-get install -y apt-transport-https grafana after adding the Grafana APT repository. Grafana runs as a systemd service on port 3000. Prometheus binaries are available at prometheus.io/download — no package manager needed, just extract and run.

Exposing Metrics from Your Application

Prometheus uses a pull model: it scrapes an HTTP endpoint (usually /metrics) on your app at regular intervals. Your app must expose metrics in the Prometheus text format. Client libraries handle this automatically.

Node.js with prom-client

metrics.js (Node.js)
const client = require('prom-client');
const express = require('express');

// Collect default Node.js metrics (heap, GC, event loop lag)
client.collectDefaultMetrics({ prefix: 'myapp_' });

// Custom counter: track every HTTP request
const httpRequestsTotal = new client.Counter({
  name: 'myapp_http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'status', 'route']
});

// Histogram: track request duration with buckets
const httpRequestDuration = new client.Histogram({
  name: 'myapp_http_request_duration_seconds',
  help: 'HTTP request latency in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});

// Gauge: current active connections
const activeConnections = new client.Gauge({
  name: 'myapp_active_connections',
  help: 'Number of active connections'
});

const app = express();

// Middleware: record metrics for every request
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  activeConnections.inc();
  res.on('finish', () => {
    const route = req.route?.path || req.path;
    httpRequestsTotal.inc({ method: req.method, status: res.statusCode, route });
    end({ method: req.method, route, status: res.statusCode });
    activeConnections.dec();
  });
  next();
});

// Prometheus scrapes this endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

app.listen(8080);

Laravel with spatie/prometheus-php

app/Http/Controllers/MetricsController.php
<?php

namespace App\Http\Controllers;

use Prometheus\CollectorRegistry;
use Prometheus\RenderTextFormat;

class MetricsController extends Controller
{
    public function __invoke(CollectorRegistry $registry)
    {
        $renderer = new RenderTextFormat();
        $result = $renderer->render($registry->getMetricFamilySamples());

        return response($result, 200)
            ->header('Content-Type', RenderTextFormat::MIME_TYPE);
    }
}

// In a middleware or service provider, increment counters:
$counter = $registry->getOrRegisterCounter(
    'myapp',
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'status']
);
$counter->incBy(1, [$request->method(), $response->status()]);

Building Your First Dashboard

With Prometheus connected, you're ready to build dashboards. Navigate to Dashboards → New dashboard → Add visualization. Select your Prometheus data source and start writing PromQL queries.

Essential PromQL Queries

PromQL (Prometheus Query Language) is the heart of Grafana + Prometheus monitoring. It's a functional query language designed specifically for time-series data. Here are the queries every developer should know:

CPU utilization percentage per instance (idle inverted):

# Overall CPU usage %
100 - (avg by (instance) (
  rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100)

# Per-core breakdown
100 - (avg by (instance, cpu) (
  rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100)

# User vs system time
rate(node_cpu_seconds_total{mode="user"}[5m]) * 100
rate(node_cpu_seconds_total{mode="system"}[5m]) * 100

Memory usage and availability:

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
  / node_memory_MemTotal_bytes * 100

# Absolute memory used (GB)
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 1e9

# Swap usage
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes)
  / node_memory_SwapTotal_bytes * 100

HTTP request rates and error rates:

# Requests per second
rate(myapp_http_requests_total[5m])

# HTTP error rate (5xx)
rate(myapp_http_requests_total{status=~"5.."}[5m])
  / rate(myapp_http_requests_total[5m]) * 100

# Requests per second by route
sum by (route) (
  rate(myapp_http_requests_total[5m])
)

Percentile latency using histograms:

# 50th percentile (median) latency
histogram_quantile(0.50,
  sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le)
)

# 95th percentile latency
histogram_quantile(0.95,
  sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le)
)

# 99th percentile — catch worst-case outliers
histogram_quantile(0.99,
  sum(rate(myapp_http_request_duration_seconds_bucket[5m])) by (le, route)
)

Using Dashboard Variables

Variables make dashboards reusable. Instead of hard-coding an instance name into every query, you define a variable like $instance that populates from Prometheus label values, then reference it in all queries.

Variable Configuration (Query type)
# Variable name: instance
# Data source: Prometheus
# Query: label_values(node_cpu_seconds_total, instance)
# This populates a dropdown with all scraped instances

# Then use it in panels:
100 - (avg by (instance) (
  rate(node_cpu_seconds_total{mode="idle", instance="$instance"}[5m])
) * 100)

# Multi-value variable (select multiple instances):
rate(myapp_http_requests_total{instance=~"$instance"}[5m])

Configuring Alerts

Grafana's built-in alert engine (Grafana Alerting, available since Grafana 8) lets you define alert rules directly in the UI without a separate tool. Alerts fire when a query crosses a threshold for a sustained duration.

Alert Fatigue is Real

Alerting on every small blip destroys trust in your monitoring system. Only alert on conditions that require human action. Use a for duration (e.g., 5 minutes) to avoid alerts firing on transient spikes. Distinguish between pages (wake someone up at 3am) and tickets (fix during business hours).

Creating an Alert Rule

Alert Rule: High CPU Usage
# Navigate to: Alerting → Alert rules → New alert rule

# Step 1: Define query (A)
# Expression:
100 - (avg by (instance) (
  rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100)

# Step 2: Set condition (B)
# Type: Classic condition
# When: avg() of A IS ABOVE 85

# Step 3: Set evaluation
# Evaluate every: 1m
# For: 5m   ← must stay above threshold for 5 min to fire

# Step 4: Set labels for routing
severity: warning
team: infrastructure

# Alert message template:
# CPU on {{ $labels.instance }} is {{ printf "%.1f" $values.A.Value }}%
# — above the 85% threshold for 5 minutes

Setting Up Contact Points

Slack Contact Point Configuration
# Alerting → Contact points → Add contact point

# Type: Slack
# Webhook URL: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
# Channel: #alerts-production

# Message template (optional, override default):
{{ define "slack.message" -}}
{{ if eq .Status "firing" }}🔴{{ else }}✅{{ end }}
*{{ .CommonLabels.alertname }}* — {{ .Status | toUpper }}
{{ range .Alerts }}
• *Instance:* {{ .Labels.instance }}
• *Severity:* {{ .Labels.severity }}
• *Value:* {{ .Annotations.value }}
{{ end }}
{{- end }}

After creating contact points, configure Notification policies to route alerts based on labels. For example, route severity=critical alerts to PagerDuty and severity=warning alerts to Slack.

Grafana Panel Types Reference

Choosing the right panel type is as important as writing the right query. Here's when to use each visualization:

Panel Type Best For Example Metric
Time Series Trending metrics over time, spotting spikes and patterns CPU %, requests/sec, latency
Gauge Single current value vs a min/max range with color thresholds Disk usage %, memory used
Stat Single highlighted value with optional sparkline Uptime, total requests, error count
Bar Chart Comparing values across categories or time intervals Requests by route, errors by service
Table Detailed tabular data, sortable with multiple columns Top 10 slowest endpoints, alert list
Heatmap Distribution over time — latency percentiles, request buckets Response time distribution
Pie / Donut Proportional breakdown of a total Traffic by country, errors by type
Logs Log stream from Loki with highlighting and filtering Application logs, error logs

Production Monitoring Best Practices

A monitoring system is only as good as the questions it helps you answer quickly during an incident. The following frameworks help you decide what to measure and how to organize it.

The Four Golden Signals (Google SRE)

Google's Site Reliability Engineering book defines four signals that matter most for any user-facing service:

Dashboard Organization Strategy

Level 1: Service Overview Golden Signals — all services on one screen Level 2: API Service Routes, errors, latency Level 2: Database Queries, connections, locks Level 2: Infrastructure CPU, memory, disk, network Level 3: Per-endpoint drill-down Level 3: Slow query analysis Level 3: Host-level details

Three-tier dashboard hierarchy: overview → service → detail

Production Monitoring Checklist

  • Use the RED method for services: Rate (requests/sec), Errors (error rate), Duration (latency) — a concise alternative to the Four Golden Signals for microservices
  • Always track p99, not just averages: Averages mask the 1% of users having a terrible experience
  • Set retention appropriately: 15 days for high-resolution data, use recording rules to downsample for longer-term trends
  • Use annotations for deployments: Mark every release so you can instantly correlate latency changes with code changes
  • Provision dashboards as code: Store dashboard JSON in Git and provision via Grafana's provisioning directory — avoids dashboard drift
  • Test your alerts: Regularly verify that contact points work with test notifications; an alert that silently fails is worse than no alert
  • Keep dashboards focused: One row per concern, max 12 panels per dashboard — dense dashboards slow down incident response

Recording Rules for Performance

Complex PromQL queries over large time ranges are expensive. Recording rules pre-compute and store the results as new metrics, making dashboards load faster and reducing Prometheus CPU load.

recording_rules.yml
groups:
  - name: myapp_recording_rules
    interval: 30s
    rules:
      # Pre-compute HTTP request rate to speed up dashboards
      - record: job:myapp_http_requests_total:rate5m
        expr: sum by (job) (rate(myapp_http_requests_total[5m]))

      # Pre-compute error rate percentage
      - record: job:myapp_http_error_rate:rate5m
        expr: |
          sum by (job) (rate(myapp_http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(myapp_http_requests_total[5m])) * 100

      # Pre-compute p95 latency
      - record: job:myapp_request_duration_p95:rate5m
        expr: |
          histogram_quantile(0.95,
            sum by (le, job) (
              rate(myapp_http_request_duration_seconds_bucket[5m])
            )
          )

Useful Grafana HTTP API Commands

Task API Command
List all dashboards GET /api/search?type=dash-db
Export dashboard JSON GET /api/dashboards/uid/:uid
Import dashboard POST /api/dashboards/db
List alert rules GET /api/v1/provisioning/alert-rules
Silence an alert POST /api/alertmanager/grafana/api/v2/silences
Create annotation POST /api/annotations
Test data source GET /api/datasources/proxy/:id/api/v1/query

Conclusion: From Blind to Observant

A well-configured Grafana stack transforms how you operate software. Instead of reactive fire-fighting, you gain proactive visibility — you see the CPU trending toward saturation 20 minutes before it becomes a user-facing problem. You correlate a latency spike with a deployment that happened 30 minutes ago. You know which API endpoint is responsible for 60% of your database load.

The path from a blank Grafana instance to a production-grade monitoring system has clear steps: instrument your application with a Prometheus client library, configure scrape targets, write PromQL queries that answer the questions that matter (latency, error rate, saturation), and build dashboards that surface problems at a glance.

Key Takeaways

  • Grafana is a visualization layer, not a data store — it queries Prometheus, databases, and other sources in real time
  • Prometheus uses a pull model — your app must expose a /metrics endpoint; client libraries handle the format
  • PromQL's rate() function is essential for counters — raw counter values are meaningless without it
  • Histogram metrics enable percentile queries — always use histograms for latency, never summaries, to allow aggregation across instances
  • Alert on symptoms, not causes — "users are getting errors" is more actionable than "CPU is at 70%"
  • Provision dashboards as code — Grafana's provisioning system lets you store dashboards in Git and deploy them automatically
  • The four golden signals — Latency, Traffic, Errors, Saturation — are the foundation of any production dashboard
"You can't manage what you can't measure. But with Grafana, you can measure everything — and finally understand what you're managing."
Grafana Monitoring Metrics Prometheus DevOps Observability PromQL
Mayur Dabhi

Mayur Dabhi

Full Stack Developer with 5+ years of experience building scalable web applications with Laravel, React, and Node.js.