Safe vs. Risky: Node Exporter Docker Commands Compared

Diff on those two commands :

docker run -d \
  –name=node-exporter \
  –restart=always \
  –net=”host” \
  –pid=”host” \
  -v “/:/host:ro” \
  –log-driver json-file \
  –log-opt max-size=10m \
  –log-opt max-file=45 \
  quay.io/prometheus/node-exporter:latest \
  –path.rootfs=/host

docker run -d   \
–name=node-exporter \   
–restart=always  \
–net=”host”  \
–pid=”host”  \ 
-v “/:/host:ro”   quay.io/prometheus/node-exporter:latest   –path.rootfs=/host

Both commands are designed to spin up the Prometheus Node Exporter to monitor your host machine’s physical hardware. While they look nearly identical, the first command is production-ready, while the second command has a hidden trap that can eventually crash your server.

Here is the exact breakdown of the two major differences between them.

Difference 1: Log Rotation (The Crucial Difference)

The primary difference lies in these three lines present only in the first command:

Bash

--log-driver json-file \
--log-opt max-size=10m \
--log-opt max-file=45 \
The First Command (Safe & Controlled)

This explicitly configures Docker’s logging mechanisms. It tells Docker:

  • Treat container logs as standard JSON files.
  • max-size=10m: Once Node Exporter’s log file reaches 10 Megabytes, roll it over and start a new one.
  • max-file=45: Keep a maximum of 45 old log files.

This sets a strict upper bound on how much space Node Exporter’s logs can take up on your host: $45 \times 10\text{ MB} = 450\text{ MB}$ maximum.

The Second Command (The Hidden Danger)

Because it omits these lines, it uses your system’s default Docker logging behavior. By default, Docker does not limit log file sizes.

If Node Exporter encounters a continuous warning or error (for example, if a specific disk partition becomes unreadable), it will spam logs rapidly. Without a cap, that single log file will grow to 10 GB, 50 GB, or more, until it completely consumes your host’s hard drive and locks up the server.

Difference 2: Readability and Syntax

The second difference comes down to formatting and execution safety.

  • First Command: Uses the backslash (\) at the end of every line. This tells the Linux terminal, “This command isn’t finished yet, continue reading on the next line.” It makes the script highly readable and easy to modify.
  • Second Command: Missing the backslashes. If you attempt to copy and paste the second block exactly as it is typed into a standard Linux terminal, it will break. The terminal will try to execute docker run -d by itself on line 1, fail because it has no image name, and then treat the remaining lines as separate, unrecognized commands.

Summary Matrix

FeatureCommand 1 (With Log Options)Command 2 (Without Log Options)
Maximum Disk Used by LogsCapped at 450 MBInfinite (Will grow until disk is full)
Log Files Auto-Cleaned?Yes, old logs are purged automatically.No, requires manual intervention or a global Docker daemon config.
Terminal Copy-Paste ReadyYes (Clean line breaks via \).No (Will fail unless flattened into a single line).
Which one should you use?

Always use the first command. When you are managing + 20 different production servers, you cannot afford to have a single chatty container quietly fill up a local hard drive. The first command ensures predictable storage safety.

Building a Grafana Dashboard for Multi-Host Metrics

Centralized monitoring architecture.

One central Grafana and Prometheus server that pulls metrics from 20 lightweight cAdvisor agents running across your network.

Here is the cleanest way to architect and deploy this.

Centralized Architecture Overview

[ Central Monitor Server ] [ 20x Remote Docker Hosts ]
┌────────────────────────┐ ┌─────────────────────────┐
│ Grafana │ │ Docker Host 01 │
│ ▲ │ │ └─ cAdvisor (Port 8080)│
│ │ (Queries) │ └─────────────────────────┘
│ Prometheus │◄─────────────┐ ▲
└────────────────────────┘ (Scrapes │ │
over HTTP) ├─────────────┤
│ ▼
│┌─────────────────────────┐
││ Docker Host 20 │
└┤ └─ cAdvisor (Port 8080)│
└─────────────────────────┘
  • The Agents (Remote Hosts): Every one of your 20 servers runs just a single, lightweight cAdvisor container. They expose their metrics on port 8080.
  • The Core (Central Host): One designated management server runs Prometheus and Grafana. Prometheus is configured to reach out over your network to all 20 servers to grab their metrics.

Step 1: Deploy cAdvisor on all 20 Remote Hosts

You need to spin up cAdvisor on every Docker server. If you use an automation tool like Ansible, this is a one-click playbook. Otherwise, run this docker run command on each machine:

Bash

docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8080:8080 \
--detach=true \
--name=cadvisor \
--privileged \
--device=/dev/kmsg \
--restart=always \
gcr.io/cadvisor/cadvisor:latest

Security Note: By publishing 8080:8080, cAdvisor metrics are public to anyone who can reach that IP. Ensure your internal firewall/security groups only allow traffic to port 8080 from your Central Prometheus Server IP.

Step 2: Configure the Central Prometheus Server

On your central monitoring server, create a prometheus.yml file. Instead of pointing to localhost, you will list your 20 servers using Prometheus targets and labels. Labels are crucial here because they allow you to filter your Grafana dashboard by specific servers.

YAML

global:
scrape_interval: 15s
scrape_configs:
- job_name: 'docker-swarm-nodes'
static_configs:
- targets: ['192.168.1.50:8080']
labels:
instance: 'prod-web-01'
env: 'production'
- targets: ['192.168.1.51:8080']
labels:
instance: 'prod-db-01'
env: 'production'
- targets: ['192.168.1.70:8080']
labels:
instance: 'stage-app-01'
env: 'staging'
# ... add the remaining 17 servers here

Start Prometheus and Grafana on this central machine using a simplified version of the Docker Compose template from earlier (removing the local cAdvisor block from it).

Step 3: Multi-Host Grafana Dashboard

Once Prometheus is scraping all 20 hosts, you need a Grafana dashboard that lets you switch between them smoothly.

  1. Go to Grafana -> Dashboards -> Import.
  2. Use Dashboard ID: 14282 or 10619 (both are heavily optimized for multi-host setups).
  3. These dashboards automatically generate a dropdown menu at the top left labeled “Instance” or “Host” based on the labels you defined in your prometheus.yml.

You can now view your entire fleet’s aggregated resource consumption, or drill down into a specific container running on prod-db-01.

Pro-Tips for Managing 20+ Hosts

  • Node Exporter: cAdvisor only monitors container metrics. If you want to monitor the host Linux OS itself (host disk space, total RAM usage, bare-metal CPU temperature), deploy Prometheus Node Exporter alongside cAdvisor on all 20 nodes, exposing it on port 9100.
  • Scale Warning: Scraping 20 hosts every 15 seconds will generate a decent chunk of data. Ensure your Central Prometheus server has a reasonable storage retention policy set (e.g., --storage.tsdb.retention.time=15d to keep data for 15 days) so it doesn’t quietly fill up the server’s hard drive.

Monitoring Docker with Grafana and cAdvisor

Combining Grafana and cAdvisor (Container Advisor) is the standard open-source recipe for monitoring Docker and Kubernetes container metrics (like CPU, memory, network, and disk usage).

Because cAdvisor only keeps a tiny buffer of real-time data in memory, you need a time-series database (almost always Prometheus or Grafana Alloy) to scrape that data and hand it off to Grafana for visualization.

Here is a breakdown of how the architecture works, how to set it up, and how to get a dashboard running.

The Monitoring Pipeline

  1. cAdvisor: Sits on the host machine, hooks into the Linux kernel cgroups, and collects resource usage from all running containers. It exposes these raw numbers at a /metrics endpoint.
  2. Prometheus: Periodically “scrapes” (pulls) the data from cAdvisor’s /metrics endpoint and stores it as historical time-series data.
  3. Grafana: Queries Prometheus using PromQL and plots the data onto clean, interactive dashboards.

Quick Setup: Docker Compose Example

The easiest way to spin up cAdvisor, Prometheus, and Grafana all at once to monitor your local Docker containers is by using a docker-compose.yml file.

YAML

version: '3.8'
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
devices:
- /dev/kmsg
restart: unless-stopped
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
restart: unless-stopped
The Prometheus Config (prometheus.yml)

To tell Prometheus to scrape your cAdvisor container, create a prometheus.yml file in the same directory:

YAML

global:
scrape_interval: 15s
scrape_configs:
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']

Run docker compose up -d, and your basic infrastructure is live!

Visualizing with Grafana Dashboards

Instead of building your container monitoring dashboards from scratch, you can import highly optimized community templates.

Recommended Pre-built Dashboards
  1. Log into your Grafana instance (usually http://localhost:3000 — default credentials are admin / admin).
  2. Add Prometheus as your data source (Connections > Data sources > Add data source).
  3. Go to Dashboards > New > Import.
  4. Paste one of these popular community Dashboard IDs:
    • 19908 (cAdvisor Docker Insights – clean, official-feeling modern dashboard)
    • 14282 (Cadvisor Exporter – great for focused container-by-container metrics)
    • 19792 (Advanced cAdvisor dashboard with support for Docker Compose projects)

Key cAdvisor Metrics to Know

When creating your own panels or alerts, look out for these fundamental metric names:

  • CPU Usage: container_cpu_usage_seconds_total (usually paired with rate() to calculate CPU percentage like sum(rate(container_cpu_usage_seconds_total[5m])) by (name))
  • Memory Usage: container_memory_usage_bytes (tells you the exact RAM consumption)
  • Network Traffic: container_network_receive_bytes_total and container_network_transmit_bytes_total
  • Disk I/O: container_fs_reads_bytes_total and container_fs_writes_bytes_total

Understanding Log Retention for Prometheus and Docker

In may setup, “retention” actually applies to two different things: Prometheus metrics (the graphs) and Ubuntu system logs (the text files).

By default, they have very different lifespans.


1. Prometheus Metrics (The Graphs)

If you didn’t specify a retention time when you ran your Prometheus Docker container, it uses the default.

  • Default Duration: 15 days.
  • What happens after? Prometheus uses a “First-In-First-Out” system. Once data hits day 16, the data from day 1 is deleted to make room.
  • How to change it: If you want 30 days of history to show your Director month-over-month trends, you need to add this flag to your docker run command for Prometheus:--storage.tsdb.retention.time=30d

2. Ubuntu System Logs (/var/log)

This is handled by a service called logrotate. It manages things like your mail.log, syslog, and auth.log.

  • Default Duration: Usually 4 weeks (28 days).
  • How it works: It keeps 4 “rotated” files. Every Sunday, it compresses the current log and deletes the oldest one.
  • How to check your specific settings:Bashcat /etc/logrotate.d/rsyslog Look for the number next to rotate. If it says 4, and the interval is weekly, you have 28 days.

3. Docker Container Logs

This is the danger zone. By default, Docker container logs (like the ones for cadvisor or node-exporter) have no limit.

If a container starts throwing thousands of errors, the log file will grow until it fills your entire hard drive. Since we are doing a Pilot Group, you should verify your Docker logging driver.

The “Safe” way to run your containers:

Add these flags to your docker run commands to ensure you only keep 3 files of 10MB each:

Bash

--log-opt max-size=10m --log-opt max-file=3

Summary Table

Data TypeDefault RetentionControlled By
Prometheus Data15 Days--storage.tsdb.retention.time
System Logs~28 Days/etc/logrotate.conf
Docker LogsUnlimited (Until disk is full)Docker Log Driver

Recommendation for your 20 Servers

For an Executive Director’s report, 15 days is usually too short. Most admins prefer 90 days for Prometheus so they can compare “This Quarter vs. Last Quarter.”

To check how much disk space your current Prometheus data is taking:

Bash

du -sh /var/lib/docker/volumes/<prometheus_volume_name>/_data

To update your Prometheus retention to 90 days, you need to restart the container with a specific flag. Since you are likely running this via a docker run command, we will stop the old one and start the new one with the updated storage policy.

1. Update Prometheus to 90-Day Retention

Run these commands on your Central Monitoring Server:

Bash

# 1. Stop and remove the existing Prometheus container
docker stop prometheus
docker rm prometheus
# 2. Start it again with the 90-day retention flag
docker run -d \
--name=prometheus \
--restart=always \
--publish=9090:9090 \
-v /etc/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
-v prometheus-data:/prometheus \
prom/prometheus:latest \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--storage.tsdb.retention.time=90d

2. Why 90 Days? (The Strategy)

Setting it to 90 days is the “sweet spot” for several reasons:

  • Quarterly Reporting: You can show your Executive Director performance trends across a full 3-month business quarter.
  • Capacity Planning: 90 days of data allows the Predictive AI (the predict_linear we discussed) to be much more accurate because it has a longer history to analyze.
  • Storage Management: Prometheus is very efficient. For 20 servers, 90 days of metrics will likely only consume about 5GB to 10GB of disk space.

3. Verify the Change

Once the container is back up, you can verify that the new setting is active:

  1. Open your browser to http://<CENTRAL_IP>:9090/status.
  2. Look for Command-Line Flags.
  3. Confirm that --storage.tsdb.retention.time=90d is listed there.

4. Pro-Tip: Disk Space Alert

Since you are now keeping 6 times more data than the default, you should add a simple alert in Grafana to monitor the Central Server’s own disk space.

The “Golden Rule” of Monitoring: The monitoring server must be the most stable server in the fleet. If its disk fills up because of long retention, you lose visibility into all other 20 servers.

Next Steps for your Project

Now that your data is safe for the long term:

  1. Check the logs: docker logs prometheus to ensure there are no “permission denied” errors on the data folder.
  2. Snapshot Check: Since you are modifying the central server, it wouldn’t hurt to take a quick vSphere snapshot of this one too, just like you did for the pilot nodes.

Setup Node Exporter for Centralized Monitoring

1. Run this on Linux 20 server(s)

I will provide ansible playbook in next post ( when you have a multiple severs, automation is the key)

docker run -d \
--name=node-exporter \
--restart=always \
--net="host" \
--pid="host" \
-v "/:/host:ro,rslave" \
quay.io/prometheus/node-exporter:latest \
--path.rootfs=/host

I would use :ro,rslave instead of only :ro, because the official Docker example for node_exporter uses bind mounting so the container can correctly see host mount points. Node Exporter is meant to monitor the host system, not just the container. (GitHub)

Check one server:

curl http://localhost:9100/metrics

From central Prometheus server:

curl http://SERVER_IP:9100/metrics

2. Open firewall only from Prometheus server

On each Linux host, allow port 9100 only from your central Prometheus server:

sudo ufw allow from PROMETHEUS_SERVER_IP to any port 9100 proto tcp

Do not expose 9100 publicly.


3. Central Prometheus config

On your central monitoring server, Prometheus scrapes all 20 Node Exporters.

prometheus.yml:

global:
scrape_interval: 15s
scrape_configs:
- job_name: "linux_servers"
static_configs:
- targets:
- "10.0.1.11:9100"
- "10.0.1.12:9100"
- "10.0.1.13:9100"
- "10.0.1.14:9100"
- "10.0.1.15:9100"
# add all 20 servers here

Prometheus uses scrape_configs and targets to pull metrics from exporters. (Prometheus)

Restart Prometheus:

docker restart prometheus

4. Add Prometheus to Grafana

In Grafana:

Connections → Data sources → Prometheus
URL: http://PROMETHEUS_SERVER_IP:9090
Save & Test

Then import dashboard:

Dashboard ID: 1860

That is the popular Node Exporter Full dashboard. Example of dashboard


Final architecture

20 Linux Servers
↓ node-exporter :9100
Central Prometheus
Grafana Dashboard

Important: Node Exporter does not send data to Grafana directly.
It exposes metrics, Prometheus pulls them, and Grafana visualizes Prometheus data.

cAdvisor: Your Guide to Container Monitoring

cAdvisor Explained

What is cAdvisor?

cAdvisor (Container Advisor) is an open-source tool by Google that collects, aggregates, and exports resource usage and performance metrics from running containers. It gives you deep visibility into what every container on your host is doing.

┌─────────────────────────────────────────────────────────────┐
│ LINUX HOST │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Container │ │Container │ │Container │ │Container │ │
│ │ nginx │ │ api │ │ postgres │ │ redis │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ └──────────────┴──────────────┴──────────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ cAdvisor │ │
│ │ │ │
│ │ reads cgroups │ │
│ │ reads /proc │ │
│ │ reads /sys │ │
│ │ reads Docker API │ │
│ └─────────┬─────────┘ │
│ │ exposes │
│ ┌─────────▼─────────┐ │
│ │ :8080/metrics │ │
│ │ (Prometheus fmt) │ │
│ └───────────────────┘ │
└─────────────────────────────────────────────────────────────┘

How cAdvisor Works

Container Runtime (Docker / containerd)
│ Docker API / containerd API
┌─────────────────────────────────────┐
│ cAdvisor │
│ │
│ ┌─────────────────────────────┐ │
│ │ Container Discovery │ │
│ │ polls Docker API every 1s │ │
│ │ detects start/stop │ │
│ └──────────────┬──────────────┘ │
│ │ │
│ ┌──────────────▼──────────────┐ │
│ │ Metrics Collection │ │
│ │ /sys/fs/cgroup (limits) │ │
│ │ /proc/<pid>/ (usage) │ │
│ │ /sys/class/net/ (network) │ │
│ └──────────────┬──────────────┘ │
│ │ │
│ ┌──────────────▼──────────────┐ │
│ │ In-memory Storage │ │
│ │ keeps ~2 min of history │ │
│ └──────────────┬──────────────┘ │
│ │ │
│ ┌──────────────▼──────────────┐ │
│ │ Export Endpoints │ │
│ │ /metrics (Prometheus) │ │
│ │ /api/v1.3 (REST API) │ │
│ │ /containers (Web UI) │ │
└──┴─────────────────────────────┴────┘

Deploy cAdvisor

Standalone Docker

# docker-compose.yml
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.2
container_name: cadvisor
restart: unless-stopped
ports:
- "8080:8080"
# Required volume mounts — read host filesystem
volumes:
- /:/rootfs:ro # root filesystem
- /var/run:/var/run:ro # Docker socket dir
- /var/run/docker.sock:/var/run/docker.sock:ro # Docker API
- /sys:/sys:ro # kernel/cgroups info
- /var/lib/docker:/var/lib/docker:ro # Docker data dir
- /dev/disk:/dev/disk:ro # disk info
# Required for accessing kernel metrics
privileged: true
devices:
- /dev/kmsg # kernel message buffer
# Performance tuning
command:
- '--housekeeping_interval=10s' # collect every 10s
- '--max_housekeeping_interval=15s'
- '--event_storage_event_limit=default=0'
- '--event_storage_age_limit=default=0'
- '--disable_metrics=percpu,sched,tcp,udp,disk,diskIO,hugetlb,referenced_memory,cpu_topology,resctrl'
- '--docker_only=true' # only Docker containers
- '--store_container_labels=false'

Kubernetes DaemonSet

# cadvisor runs on every node as a DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: cadvisor
namespace: monitoring
spec:
selector:
matchLabels:
app: cadvisor
template:
metadata:
labels:
app: cadvisor
spec:
hostNetwork: true
hostPID: true
containers:
- name: cadvisor
image: gcr.io/cadvisor/cadvisor:v0.47.2
ports:
- containerPort: 8080
name: http
volumeMounts:
- name: rootfs
mountPath: /rootfs
readOnly: true
- name: var-run
mountPath: /var/run
readOnly: true
- name: sys
mountPath: /sys
readOnly: true
- name: docker
mountPath: /var/lib/docker
readOnly: true
- name: dev-disk
mountPath: /dev/disk
readOnly: true
securityContext:
privileged: true
resources:
requests:
memory: 200Mi
cpu: 150m
limits:
memory: 400Mi
cpu: 300m
volumes:
- name: rootfs
hostPath:
path: /
- name: var-run
hostPath:
path: /var/run
- name: sys
hostPath:
path: /sys
- name: docker
hostPath:
path: /var/lib/docker
- name: dev-disk
hostPath:
path: /dev/disk

cAdvisor Web UI

Access at http://localhost:8080:

http://localhost:8080/containers/ → all containers overview
http://localhost:8080/docker/ → Docker-specific view
http://localhost:8080/metrics → Prometheus metrics endpoint
Container detail page shows:
├── Isolation (CPU/memory limits set)
├── Usage (real-time CPU/memory charts)
├── Processes (running inside container)
└── Subcontainers (if applicable)

Key Metrics Exposed

cAdvisor exposes hundreds of metrics — here are the most important:

CPU Metrics
# ── Total CPU usage (all cores) ──────────────────────────────
# CPU seconds used — rate gives usage per second
container_cpu_usage_seconds_total{
name="api",
cpu="total"
}
# CPU usage % (actual percentage of one core)
rate(container_cpu_usage_seconds_total{
name="api"
}[5m]) * 100
# CPU throttled time — how long container was throttled
container_cpu_cfs_throttled_seconds_total
# CPU throttle periods — how often throttled
container_cpu_cfs_throttled_periods_total
# CPU limit (from docker run --cpus)
container_spec_cpu_quota # microseconds
container_spec_cpu_period # period in microseconds
# CPU limit in cores
container_spec_cpu_quota / container_spec_cpu_period
# CPU usage % relative to limit
rate(container_cpu_usage_seconds_total{name="api"}[5m])
/ (container_spec_cpu_quota{name="api"}
/ container_spec_cpu_period{name="api"})
* 100

Memory Metrics

# ── Memory usage ─────────────────────────────────────────────
# Current memory usage (includes cache)
container_memory_usage_bytes{name="api"}
# Working set memory (excludes reclaimable cache)
# — best metric for actual memory pressure
container_memory_working_set_bytes{name="api"}
# RSS memory (resident set size — actual RAM used by app)
container_memory_rss{name="api"}
# Page cache (filesystem cache — reclaimable)
container_memory_cache{name="api"}
# Memory limit set on container
container_spec_memory_limit_bytes{name="api"}
# Memory usage % relative to limit
container_memory_working_set_bytes{name="api"}
/ container_spec_memory_limit_bytes{name="api"}
* 100
# Memory page faults (minor — no disk I/O)
container_memory_failures_total{
name="api",
type="pgfault",
scope="container"
}
# Memory page faults (major — requires disk read)
container_memory_failures_total{
name="api",
type="pgmajfault",
scope="container"
}

Network Metrics

# ── Network I/O ──────────────────────────────────────────────
# Bytes received per second
rate(container_network_receive_bytes_total{
name="api"
}[5m])
# Bytes transmitted per second
rate(container_network_transmit_bytes_total{
name="api"
}[5m])
# Packets received per second
rate(container_network_receive_packets_total{
name="api"
}[5m])
# Packets transmitted per second
rate(container_network_transmit_packets_total{
name="api"
}[5m])
# Receive errors
rate(container_network_receive_errors_total{
name="api"
}[5m])
# Transmit errors
rate(container_network_transmit_errors_total{
name="api"
}[5m])
# Dropped packets received
rate(container_network_receive_packets_dropped_total{
name="api"
}[5m])

Disk / Filesystem Metrics

# ── Disk I/O ─────────────────────────────────────────────────
# Bytes read from disk per second
rate(container_fs_reads_bytes_total{
name="api"
}[5m])
# Bytes written to disk per second
rate(container_fs_writes_bytes_total{
name="api"
}[5m])
# Read operations per second (IOPS)
rate(container_fs_reads_total{
name="api"
}[5m])
# Write operations per second (IOPS)
rate(container_fs_writes_total{
name="api"
}[5m])
# Filesystem space used by container
container_fs_usage_bytes{
name="api"
}
# Filesystem space limit
container_fs_limit_bytes{
name="api"
}

Container Lifecycle Metrics

# ── Container state ──────────────────────────────────────────
# Container start time (unix timestamp)
container_start_time_seconds{name="api"}
# Container uptime in seconds
time() - container_start_time_seconds{name="api"}
# Last time container was seen alive
container_last_seen{name="api"}
# Detect container restarts (changes in start time)
changes(container_start_time_seconds{name="api"}[1h])

Important Metric Labels

cAdvisor adds rich labels to every metric:

container_cpu_usage_seconds_total{
id="/docker/abc123", # container ID path
image="nginx:latest", # image name
name="my-nginx", # container name
container_label_com_docker_compose_project="myapp",
container_label_com_docker_compose_service="nginx",
container_label_com_docker_compose_version="2.0",
cpu="total"
}
LabelValue exampleUse
namemy-nginxFilter by container name
imagenginx:latestFilter by image
id/docker/abc123Unique container ID
container_label_*compose project/serviceFilter by compose labels
interfaceeth0Network interface
device/dev/sdaDisk device

Prometheus Scrape Config for cAdvisor

# prometheus.yml
scrape_configs:
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
# Drop metrics we don't need (reduce cardinality)
metric_relabel_configs:
# Drop pause containers (k8s infrastructure)
- source_labels: [image]
regex: 'k8s.gcr.io/pause.*'
action: drop
# Drop empty container names
- source_labels: [name]
regex: ''
action: drop
# Drop high-cardinality metrics not needed
- source_labels: [__name__]
regex: 'container_tasks_state|container_memory_failures_total'
action: drop
# Keep only Docker containers (not system cgroups)
- source_labels: [container_label_com_docker_compose_service]
regex: '.+'
action: keep

Useful PromQL Queries

# ── Top Consumers ────────────────────────────────────────────
# Top 5 containers by CPU usage
topk(5,
rate(container_cpu_usage_seconds_total{
name!="", image!=""
}[5m]) * 100
)
# Top 5 containers by memory (working set)
topk(5,
container_memory_working_set_bytes{
name!="", image!=""
}
)
# Top 5 containers by network receive
topk(5,
rate(container_network_receive_bytes_total{
name!="", image!=""
}[5m])
)
# Top 5 containers by disk writes
topk(5,
rate(container_fs_writes_bytes_total{
name!="", image!=""
}[5m])
)
# ── Health Checks ────────────────────────────────────────────
# Containers using more than 80% of memory limit
container_memory_working_set_bytes{name!=""}
/ container_spec_memory_limit_bytes{name!=""} > 0.8
# Containers being CPU throttled
rate(container_cpu_cfs_throttled_seconds_total{
name!=""
}[5m]) > 0
# Throttle % (how much CPU time is throttled)
rate(container_cpu_cfs_throttled_periods_total{
name!=""
}[5m])
/ rate(container_cpu_cfs_periods_total{
name!=""
}[5m])
* 100
# Containers that restarted in last hour
changes(container_start_time_seconds{
name!="", image!=""
}[1h]) > 0
# ── Resource Efficiency ──────────────────────────────────────
# CPU limit utilization per container
rate(container_cpu_usage_seconds_total{name!=""}[5m])
/ (container_spec_cpu_quota{name!=""}
/ container_spec_cpu_period{name!=""})
* 100
# Memory limit utilization per container
container_memory_working_set_bytes{name!=""}
/ container_spec_memory_limit_bytes{name!=""}
* 100
# Containers with no resource limits set
container_spec_memory_limit_bytes == 0

cAdvisor Grafana Dashboard

Import dashboard ID 14282 or build panels manually:

Docker Overview Dashboard
├── Row 1: Summary Stats
│ ├── Total containers running (stat)
│ ├── Total CPU usage % (gauge)
│ ├── Total memory usage (gauge)
│ └── Total network I/O (stat)
├── Row 2: CPU
│ ├── CPU usage by container (time series, stacked)
│ ├── CPU throttling % by container (time series)
│ └── CPU limit utilization (bar gauge)
├── Row 3: Memory
│ ├── Memory usage by container (time series, stacked)
│ ├── Memory working set by container (time series)
│ ├── Memory limit utilization % (bar gauge)
│ └── OOM events (stat)
├── Row 4: Network
│ ├── Network received by container (time series)
│ ├── Network transmitted by container (time series)
│ ├── Network errors (time series)
│ └── Dropped packets (time series)
└── Row 5: Disk
├── Disk read bytes by container (time series)
├── Disk write bytes by container (time series)
├── Disk IOPS (time series)
└── Container filesystem usage (bar gauge)

Alert Rules for cAdvisor

# prometheus/rules/cadvisor_alerts.yml
groups:
- name: cadvisor
rules:
# Container down
- alert: ContainerDown
expr: |
time() - container_last_seen{
name!="",
image!=""
} > 60
for: 1m
labels:
severity: critical
annotations:
summary: "Container down: {{ $labels.name }}"
description: "Container has not been seen for 60 seconds"
# High CPU throttling
- alert: ContainerCPUThrottling
expr: |
rate(container_cpu_cfs_throttled_periods_total{name!=""}[5m])
/ rate(container_cpu_cfs_periods_total{name!=""}[5m])
* 100 > 50
for: 5m
labels:
severity: warning
annotations:
summary: "CPU throttling: {{ $labels.name }}"
description: "{{ $value | printf \"%.0f\" }}% of CPU time is throttled"
# High memory usage
- alert: ContainerMemoryHigh
expr: |
container_memory_working_set_bytes{name!=""}
/ container_spec_memory_limit_bytes{name!=""}
* 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory: {{ $labels.name }}"
description: "Memory usage is {{ $value | printf \"%.1f\" }}% of limit"
# Container OOM risk
- alert: ContainerOOMRisk
expr: |
container_memory_working_set_bytes{name!=""}
/ container_spec_memory_limit_bytes{name!=""}
* 100 > 95
for: 2m
labels:
severity: critical
annotations:
summary: "OOM risk: {{ $labels.name }}"
description: "Memory at {{ $value | printf \"%.1f\" }}% — OOM kill imminent"
# Container restarting
- alert: ContainerRestarting
expr: |
changes(container_start_time_seconds{
name!="", image!=""
}[30m]) > 3
for: 0m
labels:
severity: warning
annotations:
summary: "Container restarting: {{ $labels.name }}"
description: "Restarted {{ $value }} times in last 30 minutes"
# No CPU limit set
- alert: ContainerNoCPULimit
expr: |
container_spec_cpu_quota{name!="", image!=""} == -1
for: 5m
labels:
severity: warning
annotations:
summary: "No CPU limit: {{ $labels.name }}"
description: "Container has no CPU limit — can consume all host CPU"
# No memory limit set
- alert: ContainerNoMemoryLimit
expr: |
container_spec_memory_limit_bytes{
name!="", image!=""
} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "No memory limit: {{ $labels.name }}"
description: "Container has no memory limit — OOM kill risk to host"

cAdvisor vs Node Exporter

They are complementary — not alternatives:

Node ExportercAdvisor
ScopeHost / OS levelContainer level
CPU metricsPer core, per modePer container
MemoryHost RAM breakdownPer container + limits
NetworkPer NIC, host-levelPer container
DiskPer device, per mountPer container writes
ProcessesHost process countContainer processes
LimitsN/ACPU/memory limits & usage
Best forIs the server healthy?Which container is the problem?
Debugging workflow:
Node Exporter → "Host CPU is 95%"
cAdvisor → "api container using 80% of host CPU"
App metrics → "api processing 10k req/s, 50ms p99"
Root cause found

cAdvisor Limitations

LimitationWorkaround
Only ~2 min in-memory historyUse Prometheus for long-term storage
High metric cardinality with many containersDrop unused metrics via relabeling
No application-level metricsAdd app-specific exporters
No log collectionUse Loki + Promtail alongside
No alertingUse Prometheus Alertmanager
Resource overhead on busy hostsTune --housekeeping_interval
No cross-host aggregationPrometheus federation or Thanos

Performance Tuning

# Reduce cAdvisor overhead on busy hosts
command:
# Increase collection interval (default 1s)
- '--housekeeping_interval=10s'
# Disable metrics you don't need
- '--disable_metrics=percpu,sched,tcp,udp,hugetlb,referenced_memory,cpu_topology,resctrl'
# Only monitor Docker (not all cgroups)
- '--docker_only=true'
# Don't store container labels (reduce cardinality)
- '--store_container_labels=false'
# Allowlist only needed labels
- '--allowlisted_container_labels=com.docker.compose.service,com.docker.compose.project'
# Reduce in-memory storage
- '--memory_storage_duration=1m'

cAdvisor is the standard tool for container-level observability — it answers the question “what is this specific container doing?” and is the foundation of container monitoring in both Docker and Kubernetes environments. Paired with Node Exporter for host metrics and Prometheus for storage, it gives you complete visibility from hardware up to individual container processes.

Monitor Linux and Docker with Grafana & Prometheus

Monitor Linux Server and Docker with Grafana and Prometheus

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│ LINUX SERVER │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Node Exporter│ │cAdvisor │ │ Docker Engine │ │
│ │ │ │ │ │ (metrics endpoint│ │
│ │ CPU/RAM/Disk │ │ Container │ │ optional) │ │
│ │ Network/FS │ │ metrics │ │ │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────────┘ │
│ │ │ │ │
│ └─────────────────┴──────────────────┘ │
│ │ scrape │
│ ┌──────▼───────┐ │
│ │ Prometheus │ │
│ │ │ │
│ │ stores │ │
│ │ metrics │ │
│ └──────┬───────┘ │
│ │ query │
│ ┌──────▼───────┐ │
│ │ Grafana │ │
│ │ │ │
│ │ dashboards │ │
│ │ alerts │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘

Project Structure

monitoring/
├── docker-compose.yml
├── prometheus/
├── prometheus.yml
└── rules/
├── linux_alerts.yml
└── docker_alerts.yml
├── grafana/
├── provisioning/
├── datasources/
└── prometheus.yml
└── dashboards/
└── dashboard.yml
└── dashboards/
├── linux-server.json
└── docker.json
└── alertmanager/
└── alertmanager.yml

Step 1 — Docker Compose Stack

# docker-compose.yml
version: '3.8'
networks:
monitoring:
driver: bridge
volumes:
prometheus_data: {}
grafana_data: {}
services:
# ── Prometheus ───────────────────────────────────────────
prometheus:
image: prom/prometheus:v2.49.0
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/rules:/etc/prometheus/rules
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d' # keep 30 days
- '--storage.tsdb.retention.size=10GB'
- '--web.enable-lifecycle' # hot reload config
- '--web.enable-admin-api'
networks:
- monitoring
# ── Grafana ──────────────────────────────────────────────
grafana:
image: grafana/grafana:10.2.0
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=SecurePass123
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_DOMAIN=grafana.yourdomain.com
- GF_SMTP_ENABLED=true
- GF_SMTP_HOST=smtp.gmail.com:587
- GF_SMTP_USER=alerts@yourdomain.com
- GF_SMTP_PASSWORD=your-smtp-password
- GF_SMTP_FROM_ADDRESS=alerts@yourdomain.com
networks:
- monitoring
depends_on:
- prometheus
# ── Node Exporter (Linux metrics) ────────────────────────
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
restart: unless-stopped
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
- '--collector.systemd' # systemd service metrics
- '--collector.processes' # process metrics
pid: host # see host processes
network_mode: host # see host network stats
cap_add:
- SYS_TIME
# ── cAdvisor (Docker container metrics) ──────────────────
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.2
container_name: cadvisor
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /dev/disk:/dev/disk:ro
privileged: true
devices:
- /dev/kmsg
networks:
- monitoring
# ── Alertmanager ─────────────────────────────────────────
alertmanager:
image: prom/alertmanager:v0.26.0
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--web.external-url=http://alertmanager.yourdomain.com'
networks:
- monitoring

Step 2 — Prometheus Configuration

# prometheus/prometheus.yml
global:
scrape_interval: 15s # collect metrics every 15s
evaluation_interval: 15s # evaluate rules every 15s
scrape_timeout: 10s
external_labels:
cluster: 'production'
environment: 'prod'
# Alertmanager connection
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load alert rules
rule_files:
- /etc/prometheus/rules/linux_alerts.yml
- /etc/prometheus/rules/docker_alerts.yml
scrape_configs:
# ── Prometheus self-monitoring ────────────────────────────
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
metrics_path: /metrics
# ── Linux Server (Node Exporter) ─────────────────────────
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
labels:
server: 'linux-prod-01'
env: 'production'
# Multiple servers
- job_name: 'linux-servers'
static_configs:
- targets:
- '10.0.1.10:9100'
- '10.0.1.11:9100'
- '10.0.1.12:9100'
labels:
env: 'production'
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+):.*'
replacement: '$1'
# ── Docker Containers (cAdvisor) ─────────────────────────
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
labels:
server: 'linux-prod-01'
metric_relabel_configs:
# Drop high-cardinality metrics we don't need
- source_labels: [__name__]
regex: 'container_tasks_state|container_memory_failures_total'
action: drop
# Keep only running containers
- source_labels: [container_label_com_docker_compose_service]
regex: '.+'
action: keep
# ── Docker Engine metrics (optional) ─────────────────────
- job_name: 'docker-engine'
static_configs:
- targets: ['host.docker.internal:9323']
# ── Grafana self-monitoring ───────────────────────────────
- job_name: 'grafana'
static_configs:
- targets: ['grafana:3000']
metrics_path: /metrics

Step 3 — Alert Rules

# prometheus/rules/linux_alerts.yml
groups:
- name: linux.server
interval: 30s
rules:
# ── CPU Alerts ───────────────────────────────────────────
- alert: HighCPUUsage
expr: |
100 - (avg by(instance) (
rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.1f\" }}% (threshold: 85%)"
- alert: CriticalCPUUsage
expr: |
100 - (avg by(instance) (
rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100) > 95
for: 2m
labels:
severity: critical
annotations:
summary: "Critical CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.1f\" }}%"
# ── Memory Alerts ─────────────────────────────────────────
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | printf \"%.1f\" }}%"
- alert: CriticalMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 95
for: 2m
labels:
severity: critical
annotations:
summary: "Critical memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | printf \"%.1f\" }}%"
# ── Disk Alerts ───────────────────────────────────────────
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} /
node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}) * 100 < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk {{ $labels.mountpoint }} has {{ $value | printf \"%.1f\" }}% free"
- alert: DiskSpaceCritical
expr: |
(node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} /
node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}) * 100 < 10
for: 2m
labels:
severity: critical
annotations:
summary: "Critical disk space on {{ $labels.instance }}"
description: "Only {{ $value | printf \"%.1f\" }}% disk space remaining"
- alert: DiskWillFillIn24h
expr: |
predict_linear(
node_filesystem_avail_bytes{fstype!="tmpfs"}[6h], 24 * 3600
) < 0
for: 1h
labels:
severity: warning
annotations:
summary: "Disk will fill in 24h on {{ $labels.instance }}"
# ── Network Alerts ────────────────────────────────────────
- alert: HighNetworkErrors
expr: |
rate(node_network_receive_errs_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High network errors on {{ $labels.instance }}"
description: "{{ $value | printf \"%.0f\" }} errors/sec on {{ $labels.device }}"
# ── System Alerts ─────────────────────────────────────────
- alert: ServerDown
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Server DOWN: {{ $labels.instance }}"
description: "Node exporter is not reachable"
- alert: HighLoadAverage
expr: |
node_load15 / count by(instance)(
node_cpu_seconds_total{mode="idle"}
) > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "High load average on {{ $labels.instance }}"
description: "15min load average is {{ $value | printf \"%.2f\" }} per core"
- alert: SystemdServiceFailed
expr: |
node_systemd_unit_state{state="failed"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Systemd service failed on {{ $labels.instance }}"
description: "Service {{ $labels.name }} is in failed state"
- alert: ClockSkewDetected
expr: |
abs(node_timex_offset_seconds) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Clock skew on {{ $labels.instance }}"
description: "Clock offset is {{ $value }}s"
# prometheus/rules/docker_alerts.yml
groups:
- name: docker.containers
rules:
# ── Container Status Alerts ───────────────────────────────
- alert: ContainerDown
expr: |
absent(container_last_seen{
name!="",
name!~".*_tmp.*"
})
for: 1m
labels:
severity: critical
annotations:
summary: "Container down: {{ $labels.name }}"
- alert: ContainerRestarting
expr: |
rate(container_last_seen{name!=""}[5m]) == 0
and on(name)
changes(container_last_seen{name!=""}[10m]) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Container restarting: {{ $labels.name }}"
# ── Container CPU Alerts ──────────────────────────────────
- alert: ContainerHighCPU
expr: |
(rate(container_cpu_usage_seconds_total{
name!="",
image!=""
}[5m]) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU in container {{ $labels.name }}"
description: "Container CPU is {{ $value | printf \"%.1f\" }}%"
# ── Container Memory Alerts ───────────────────────────────
- alert: ContainerHighMemory
expr: |
(container_memory_usage_bytes{name!="", image!=""} /
container_spec_memory_limit_bytes{name!="", image!=""} * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory in container {{ $labels.name }}"
description: "Memory usage is {{ $value | printf \"%.1f\" }}% of limit"
- alert: ContainerOOMKilled
expr: |
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
or
container_oom_events_total > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Container OOM killed: {{ $labels.name }}"
# ── Container Disk Alerts ─────────────────────────────────
- alert: ContainerHighDiskWrite
expr: |
rate(container_fs_writes_bytes_total{
name!="",
image!=""
}[5m]) > 50000000 # 50MB/s
for: 5m
labels:
severity: warning
annotations:
summary: "High disk writes in container {{ $labels.name }}"
description: "Writing {{ $value | humanize }}B/s"

Step 4 — Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@yourdomain.com'
smtp_auth_username: 'alerts@yourdomain.com'
smtp_auth_password: 'your-app-password'
# Route tree
route:
group_by: ['alertname', 'instance']
group_wait: 30s # wait before sending first alert
group_interval: 5m # wait between alert groups
repeat_interval: 4h # resend if still firing
receiver: 'slack-default'
routes:
# Critical → PagerDuty + Slack immediately
- match:
severity: critical
receiver: 'pagerduty-critical'
group_wait: 10s
repeat_interval: 1h
continue: true # also send to default
# Warning → Slack only
- match:
severity: warning
receiver: 'slack-warnings'
repeat_interval: 4h
# Server down → immediate notification
- match:
alertname: ServerDown
receiver: 'pagerduty-critical'
group_wait: 0s # no delay for server down
receivers:
# Default Slack channel
- name: 'slack-default'
slack_configs:
- channel: '#alerts'
title: '{{ template "slack.title" . }}'
text: '{{ template "slack.text" . }}'
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
send_resolved: true
# Warnings channel
- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warning'
title: '⚠️ {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Instance:* {{ .Labels.instance }}
*Description:* {{ .Annotations.description }}
{{ end }}
send_resolved: true
# PagerDuty for critical
- name: 'pagerduty-critical'
pagerduty_configs:
- routing_key: 'YOUR_PAGERDUTY_ROUTING_KEY'
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
severity: critical
# Email notifications
- name: 'email-alerts'
email_configs:
- to: 'devops-team@yourdomain.com'
subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
html: |
<h2>{{ .GroupLabels.alertname }}</h2>
{{ range .Alerts }}
<p><b>Instance:</b> {{ .Labels.instance }}</p>
<p><b>Description:</b> {{ .Annotations.description }}</p>
{{ end }}
send_resolved: true
inhibit_rules:
# If server is down, suppress all other alerts for that server
- source_match:
alertname: ServerDown
target_match_re:
alertname: '.+'
equal: ['instance']

Step 5 — Grafana Provisioning

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
jsonData:
timeInterval: "15s"
queryTimeout: "60s"
httpMethod: POST
# grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: 'Server Monitoring'
type: file
disableDeletion: false
editable: true
updateIntervalSeconds: 30
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true

Step 6 — Key PromQL Queries

# ── CPU ────────────────────────────────────────────────────
# CPU usage % per server
100 - (avg by(instance) (
rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100)
# CPU by mode (user, system, iowait, steal)
avg by(instance, mode) (
rate(node_cpu_seconds_total{mode!="idle"}[5m])
) * 100
# Top 5 CPU-consuming containers
topk(5,
rate(container_cpu_usage_seconds_total{
name!="", image!=""
}[5m]) * 100
)
# ── Memory ─────────────────────────────────────────────────
# Memory usage %
(1 - (node_memory_MemAvailable_bytes /
node_memory_MemTotal_bytes)) * 100
# Memory breakdown (used, cached, buffered, free)
node_memory_MemTotal_bytes - node_memory_MemFree_bytes
- node_memory_Buffers_bytes - node_memory_Cached_bytes
# Container memory usage vs limit
container_memory_usage_bytes{name!=""}
/ container_spec_memory_limit_bytes{name!=""} * 100
# ── Disk ───────────────────────────────────────────────────
# Disk usage % per mount
(1 - node_filesystem_avail_bytes{fstype!="tmpfs"} /
node_filesystem_size_bytes{fstype!="tmpfs"}) * 100
# Disk I/O read/write bytes per second
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
# Disk I/O wait time (saturation)
rate(node_disk_io_time_seconds_total[5m])
# Predict disk full in hours
predict_linear(
node_filesystem_avail_bytes{mountpoint="/"}[6h],
3600
) / 1024 / 1024 / 1024 # convert to GB
# ── Network ────────────────────────────────────────────────
# Network bandwidth in/out per interface
rate(node_network_receive_bytes_total{
device!~"lo|docker.*|veth.*"
}[5m]) * 8 # convert to bits
rate(node_network_transmit_bytes_total{
device!~"lo|docker.*|veth.*"
}[5m]) * 8
# Network errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])
# ── Docker / Containers ────────────────────────────────────
# Running containers count
count(container_last_seen{name!="", image!=""})
# Container CPU usage %
rate(container_cpu_usage_seconds_total{
name!="", image!=""
}[5m]) * 100
# Container network traffic
rate(container_network_receive_bytes_total{name!=""}[5m])
rate(container_network_transmit_bytes_total{name!=""}[5m])
# Container restart count
changes(container_start_time_seconds{name!=""}[1h])
# ── System ─────────────────────────────────────────────────
# System load per CPU core
node_load1 / count by(instance)(
node_cpu_seconds_total{mode="idle"}
)
# Open file descriptors
node_filefd_allocated / node_filefd_maximum * 100
# System uptime in days
(time() - node_boot_time_seconds) / 86400
# Number of processes
node_procs_running
node_procs_blocked

Step 7 — Deploy and Manage

# Start the monitoring stack
docker-compose up -d
# Check all services running
docker-compose ps
# NAME STATUS PORTS
# prometheus Up 0.0.0.0:9090->9090/tcp
# grafana Up 0.0.0.0:3000->3000/tcp
# node-exporter Up 0.0.0.0:9100->9100/tcp
# cadvisor Up 0.0.0.0:8080->8080/tcp
# alertmanager Up 0.0.0.0:9093->9093/tcp
# Check Prometheus targets (all should be UP)
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# Reload Prometheus config (without restart)
curl -X POST http://localhost:9090/-/reload
# Check Prometheus rules loaded
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].name'
# View Grafana logs
docker-compose logs grafana -f
# Check alertmanager config is valid
docker run --rm \
-v $(pwd)/alertmanager:/config \
prom/alertmanager:v0.26.0 \
--config.file=/config/alertmanager.yml \
check-config
# Backup Prometheus data
docker run --rm \
-v prometheus_data:/data \
-v $(pwd)/backup:/backup \
alpine tar czf /backup/prometheus-$(date +%Y%m%d).tar.gz /data
# Update stack
docker-compose pull
docker-compose up -d

Step 8 — Install Node Exporter on Bare Metal

If Prometheus runs separately from the monitored server:

# Download and install node exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--web.listen-address=:9100
[Install]
WantedBy=multi-user.target
EOF
# Create user and start service
sudo useradd -rs /bin/false node_exporter
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
# Verify
curl http://localhost:9100/metrics | head -20

Grafana Dashboard Setup

# Access Grafana
open http://localhost:3000
# Login: admin / SecurePass123
# Import community dashboards via ID:
# Node Exporter Full: ID 1860
# Docker monitoring: ID 893
# cAdvisor: ID 14282
# Prometheus stats: ID 3662
# Import via CLI
curl -X POST \
http://admin:SecurePass123@localhost:3000/api/dashboards/import \
-H 'Content-Type: application/json' \
-d '{
"dashboard": {"id": null, "uid": null},
"folderId": 0,
"inputs": [{"name": "DS_PROMETHEUS", "type": "datasource",
"pluginId": "prometheus", "value": "Prometheus"}],
"overwrite": false,
"path": "1860"
}'

Useful Grafana Panel Examples

Linux Server Dashboard panels:
├── CPU Usage gauge (0-100%, threshold at 80/95)
├── Memory Usage gauge (0-100%, threshold at 80/95)
├── Disk Usage per mount (bar gauge)
├── CPU Usage over time (time series, stacked by mode)
├── Memory breakdown (time series, stacked)
├── Network bandwidth (time series, in/out)
├── Disk I/O (time series, read/write)
├── System Load (time series, 1/5/15 min)
├── Top processes (table)
└── Uptime stat (stat panel)
Docker Dashboard panels:
├── Running containers (stat)
├── Container CPU top 10 (bar chart)
├── Container Memory top 10 (bar chart)
├── Container restarts (table with alert)
├── Network I/O per container(time series)
├── Container disk I/O (time series)
└── Container logs (logs panel via Loki)

Security Hardening

# Add basic auth to Prometheus
# Use nginx reverse proxy in front
nginx:
image: nginx:alpine
ports:
- "9090:80"
volumes:
- ./nginx/prometheus.conf:/etc/nginx/conf.d/default.conf
- ./nginx/.htpasswd:/etc/nginx/.htpasswd
# nginx/prometheus.conf
server {
listen 80;
location / {
auth_basic "Prometheus";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://prometheus:9090;
}
}
# Generate password file
htpasswd -c ./nginx/.htpasswd admin
# Restrict firewall — only allow internal access
ufw allow from 10.0.0.0/8 to any port 9090
ufw allow from 10.0.0.0/8 to any port 9100
ufw allow from 10.0.0.0/8 to any port 8080
ufw allow from anywhere to any port 3000 # Grafana only public

Quick Access URLs

Grafana: http://localhost:3000 admin/SecurePass123
Prometheus: http://localhost:9090
Alertmanager: http://localhost:9093
Node Exporter: http://localhost:9100/metrics
cAdvisor: http://localhost:8080

This gives you a production-ready monitoring stack — Linux server metrics via Node Exporter, Docker container metrics via cAdvisor, intelligent alerting via Alertmanager, and beautiful dashboards via Grafana, all wired together with Prometheus as the metrics backbone.

Docker Image Optimization: Best Practices & Tips

Here’s the practical best-practice checklist for building Docker images today:

Use a small, trusted base image and pin versions instead of relying on latest. Docker recommends choosing the right base image, keeping it small, and pinning base image versions for better security and repeatability. (Docker Documentation)

Use multi-stage builds so build tools never end up in the final runtime image. This is one of Docker’s main recommendations for producing smaller, cleaner, more secure images. (Docker Documentation)

Keep the build context small with a .dockerignore file. Excluding node_modules, .git, test artifacts, local env files, and temp files speeds builds and reduces accidental leakage into the image. Docker explicitly recommends using .dockerignore. (Docker Documentation)

Design your Dockerfile to maximize cache reuse. Copy dependency files first, install dependencies, then copy the rest of the app. Since Docker images are layer-based, ordering instructions well can make rebuilds much faster. (Docker Documentation)

Do not install unnecessary packages. Keep the image focused on one service, and remove build-only tools from the final stage. Docker also recommends creating ephemeral containers and decoupling applications where possible. (Docker Documentation)

Run the app as a non-root user whenever possible. Docker’s learning materials call out that a production-ready Dockerfile should improve security by running as non-root. (Docker Documentation)

Rebuild images regularly and use fresh base layers, especially for security patches. Docker recommends rebuilding often and using flags like --pull and, when needed, --no-cache for clean rebuilds. Also build and test images in CI. (Docker Documentation)

A solid production pattern looks like this:

# syntax=docker/dockerfile:1
FROM node:22-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
FROM node:22-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:22-alpine
WORKDIR /app
ENV NODE_ENV=production
COPY --from=deps /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
COPY package*.json ./
USER node
EXPOSE 3000
CMD ["node", "dist/server.js"]

A matching .dockerignore should usually include:

node_modules
npm-debug.log
.git
.gitignore
Dockerfile*
docker-compose*
.env
coverage
dist
tmp

For most teams, the simplest rule set is:

  1. Small pinned base image
  2. Multi-stage build
  3. .dockerignore
  4. Cache-friendly Dockerfile order
  5. Non-root runtime
  6. Rebuild in CI and scan often (Docker Documentation)

Absolutely — here’s a production-ready Docker image pattern you can reuse for most apps.

Good Dockerfile pattern

# syntax=docker/dockerfile:1
# 1) Install dependencies in a separate stage
FROM node:22-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci
# 2) Build the app
FROM node:22-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# 3) Runtime image
FROM node:22-alpine AS runtime
WORKDIR /app
ENV NODE_ENV=production
# Create/use non-root runtime
USER node
# Copy only what is needed at runtime
COPY --chown=node:node --from=deps /app/node_modules ./node_modules
COPY --chown=node:node --from=build /app/dist ./dist
COPY --chown=node:node package*.json ./
EXPOSE 3000
CMD ["node", "dist/server.js"]

Matching .dockerignore

node_modules
npm-debug.log
.git
.gitignore
Dockerfile*
docker-compose*
.env
.env.*
coverage
dist
tmp
.vscode
.idea

Why this is a strong default

Docker’s current guidance recommends:

  • multi-stage builds to keep the final image smaller and cleaner (Docker Documentation)
  • using a .dockerignore file to keep the build context small and avoid sending unnecessary files to the builder (Docker Documentation)
  • structuring the Dockerfile for better cache reuse, like copying dependency manifests before app source (Docker Documentation)
  • running the app as a non-root user in production images (Docker Documentation)
  • avoiding secrets in ARG or ENV; Docker recommends using secret mounts instead because build args and env vars can be exposed in image metadata or the final image (Docker Documentation)

Even better build command

docker build --pull -t myapp:latest .

--pull helps refresh the base image layers so you don’t keep building on stale images, which aligns with Docker’s recommendation to rebuild often and keep base layers fresh. (Docker Documentation)

7 rules to follow every time

  1. Pin the base image
FROM node:22.14-alpine
  1. Do not use latest in production
  2. Copy dependency files first
COPY package*.json ./
RUN npm ci
COPY . .
  1. Only copy runtime artifacts into the final stage
  2. Run as non-root
  3. Keep secrets out of the Dockerfile
  4. Keep one main responsibility per container when possible (Docker Documentation)

Common mistakes

Bad:

COPY . .
RUN npm install

Better:

COPY package*.json ./
RUN npm ci
COPY . .

Bad:

FROM node:latest

Better:

FROM node:22-alpine

Bad:

ENV API_KEY=secret123

Better: pass secrets at runtime or use Docker build secrets. (Docker Documentation)

If your app does not need Node at runtime

For frontend apps like React/Vite/Angular/Vue, it is often better to build in Node and serve with Nginx in the final stage, which Docker’s current framework guides also demonstrate for modern frontend apps. (Docker Documentation)

Best-practice summary

Use:

  • small pinned base image
  • multi-stage build
  • .dockerignore
  • cache-friendly layer order
  • non-root runtime
  • no secrets in ARG or ENV
  • regular rebuilds with fresh base layers (Docker Documentation)