Safe vs. Risky: Node Exporter Docker Commands Compared

May 26, 2026May 26, 2026 techhadoop docker, monitoring

Diff on those two commands :

docker run -d \
–name=node-exporter \
–restart=always \
–net=”host” \
–pid=”host” \
-v “/:/host:ro” \
–log-driver json-file \
–log-opt max-size=10m \
–log-opt max-file=45 \
quay.io/prometheus/node-exporter:latest \
–path.rootfs=/host

docker run -d \
–name=node-exporter \
–restart=always \
–net=”host” \
–pid=”host” \
-v “/:/host:ro” quay.io/prometheus/node-exporter:latest –path.rootfs=/host

Both commands are designed to spin up the Prometheus Node Exporter to monitor your host machine’s physical hardware. While they look nearly identical, the first command is production-ready, while the second command has a hidden trap that can eventually crash your server.

Here is the exact breakdown of the two major differences between them.

Difference 1: Log Rotation (The Crucial Difference)

The primary difference lies in these three lines present only in the first command:

Bash

			
--log-driver json-file \
--log-opt max-size=10m \
--log-opt max-file=45 \

The First Command (Safe & Controlled)

This explicitly configures Docker’s logging mechanisms. It tells Docker:

Treat container logs as standard JSON files.
max-size=10m: Once Node Exporter’s log file reaches 10 Megabytes, roll it over and start a new one.
max-file=45: Keep a maximum of 45 old log files.

This sets a strict upper bound on how much space Node Exporter’s logs can take up on your host: $45 \times 10\text{ MB} = 450\text{ MB}$ maximum.

The Second Command (The Hidden Danger)

Because it omits these lines, it uses your system’s default Docker logging behavior. By default, Docker does not limit log file sizes.

If Node Exporter encounters a continuous warning or error (for example, if a specific disk partition becomes unreadable), it will spam logs rapidly. Without a cap, that single log file will grow to 10 GB, 50 GB, or more, until it completely consumes your host’s hard drive and locks up the server.

Difference 2: Readability and Syntax

The second difference comes down to formatting and execution safety.

First Command: Uses the backslash (\) at the end of every line. This tells the Linux terminal, “This command isn’t finished yet, continue reading on the next line.” It makes the script highly readable and easy to modify.
Second Command: Missing the backslashes. If you attempt to copy and paste the second block exactly as it is typed into a standard Linux terminal, it will break. The terminal will try to execute docker run -d by itself on line 1, fail because it has no image name, and then treat the remaining lines as separate, unrecognized commands.

Summary Matrix

Feature	Command 1 (With Log Options)	Command 2 (Without Log Options)
Maximum Disk Used by Logs	Capped at 450 MB	Infinite (Will grow until disk is full)
Log Files Auto-Cleaned?	Yes, old logs are purged automatically.	No, requires manual intervention or a global Docker daemon config.
Terminal Copy-Paste Ready	Yes (Clean line breaks via `\`).	No (Will fail unless flattened into a single line).

Which one should you use?

Always use the first command. When you are managing + 20 different production servers, you cannot afford to have a single chatty container quietly fill up a local hard drive. The first command ensures predictable storage safety.

Building a Grafana Dashboard for Multi-Host Metrics

May 26, 2026May 26, 2026 techhadoop docker, linux, monitoring, Prometheus

Centralized monitoring architecture.

One central Grafana and Prometheus server that pulls metrics from 20 lightweight cAdvisor agents running across your network.

Here is the cleanest way to architect and deploy this.

Centralized Architecture Overview

			
[ Central Monitor Server ]               [ 20x Remote Docker Hosts ]
┌────────────────────────┐               ┌─────────────────────────┐
│        Grafana         │               │  Docker Host 01         │
│           ▲            │               │  └─ cAdvisor (Port 8080)│
│           │ (Queries)  │               └─────────────────────────┘
│       Prometheus       │◄─────────────┐             ▲
└────────────────────────┘  (Scrapes    │             │
                             over HTTP) ├─────────────┤
                                        │             ▼
                                        │┌─────────────────────────┐
                                        ││  Docker Host 20         │
                                        └┤  └─ cAdvisor (Port 8080)│
                                         └─────────────────────────┘

		

The Agents (Remote Hosts): Every one of your 20 servers runs just a single, lightweight cAdvisor container. They expose their metrics on port 8080.
The Core (Central Host): One designated management server runs Prometheus and Grafana. Prometheus is configured to reach out over your network to all 20 servers to grab their metrics.

Step 1: Deploy cAdvisor on all 20 Remote Hosts

You need to spin up cAdvisor on every Docker server. If you use an automation tool like Ansible, this is a one-click playbook. Otherwise, run this docker run command on each machine:

Bash

			
docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --volume=/dev/disk/:/dev/disk:ro \
  --publish=8080:8080 \
  --detach=true \
  --name=cadvisor \
  --privileged \
  --device=/dev/kmsg \
  --restart=always \
  gcr.io/cadvisor/cadvisor:latest

		

Security Note: By publishing 8080:8080, cAdvisor metrics are public to anyone who can reach that IP. Ensure your internal firewall/security groups only allow traffic to port 8080 from your Central Prometheus Server IP.

Step 2: Configure the Central Prometheus Server

On your central monitoring server, create a prometheus.yml file. Instead of pointing to localhost, you will list your 20 servers using Prometheus targets and labels. Labels are crucial here because they allow you to filter your Grafana dashboard by specific servers.

YAML

			
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'docker-swarm-nodes'
    static_configs:
      - targets: ['192.168.1.50:8080']
        labels:
          instance: 'prod-web-01'
          env: 'production'
      - targets: ['192.168.1.51:8080']
        labels:
          instance: 'prod-db-01'
          env: 'production'
      - targets: ['192.168.1.70:8080']
        labels:
          instance: 'stage-app-01'
          env: 'staging'
          
      # ... add the remaining 17 servers here

		

Start Prometheus and Grafana on this central machine using a simplified version of the Docker Compose template from earlier (removing the local cAdvisor block from it).

Step 3: Multi-Host Grafana Dashboard

Once Prometheus is scraping all 20 hosts, you need a Grafana dashboard that lets you switch between them smoothly.

Go to Grafana -> Dashboards -> Import.
Use Dashboard ID: 14282 or 10619 (both are heavily optimized for multi-host setups).
These dashboards automatically generate a dropdown menu at the top left labeled “Instance” or “Host” based on the labels you defined in your prometheus.yml.

You can now view your entire fleet’s aggregated resource consumption, or drill down into a specific container running on prod-db-01.

Pro-Tips for Managing 20+ Hosts

Node Exporter: cAdvisor only monitors container metrics. If you want to monitor the host Linux OS itself (host disk space, total RAM usage, bare-metal CPU temperature), deploy Prometheus Node Exporter alongside cAdvisor on all 20 nodes, exposing it on port 9100.
Scale Warning: Scraping 20 hosts every 15 seconds will generate a decent chunk of data. Ensure your Central Prometheus server has a reasonable storage retention policy set (e.g., --storage.tsdb.retention.time=15d to keep data for 15 days) so it doesn’t quietly fill up the server’s hard drive.

Monitoring Docker with Grafana and cAdvisor

May 26, 2026May 27, 2026 techhadoop docker, linux, monitoring

Combining Grafana and cAdvisor (Container Advisor) is the standard open-source recipe for monitoring Docker and Kubernetes container metrics (like CPU, memory, network, and disk usage).

Because cAdvisor only keeps a tiny buffer of real-time data in memory, you need a time-series database (almost always Prometheus or Grafana Alloy) to scrape that data and hand it off to Grafana for visualization.

Here is a breakdown of how the architecture works, how to set it up, and how to get a dashboard running.

The Monitoring Pipeline

cAdvisor: Sits on the host machine, hooks into the Linux kernel cgroups, and collects resource usage from all running containers. It exposes these raw numbers at a /metrics endpoint.
Prometheus: Periodically “scrapes” (pulls) the data from cAdvisor’s /metrics endpoint and stores it as historical time-series data.
Grafana: Queries Prometheus using PromQL and plots the data onto clean, interactive dashboards.

Quick Setup: Docker Compose Example

The easiest way to spin up cAdvisor, Prometheus, and Grafana all at once to monitor your local Docker containers is by using a docker-compose.yml file.

YAML

			
version: '3.8'
services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    devices:
      - /dev/kmsg
    restart: unless-stopped
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    restart: unless-stopped
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    restart: unless-stopped

		

The Prometheus Config (`prometheus.yml`)

To tell Prometheus to scrape your cAdvisor container, create a prometheus.yml file in the same directory:

YAML

			
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

		

Run docker compose up -d, and your basic infrastructure is live!

Visualizing with Grafana Dashboards

Instead of building your container monitoring dashboards from scratch, you can import highly optimized community templates.

Recommended Pre-built Dashboards

Log into your Grafana instance (usually http://localhost:3000 — default credentials are admin / admin).
Add Prometheus as your data source (Connections > Data sources > Add data source).
Go to Dashboards > New > Import.
Paste one of these popular community Dashboard IDs:
- 19908 (cAdvisor Docker Insights – clean, official-feeling modern dashboard)
- 14282 (Cadvisor Exporter – great for focused container-by-container metrics)
- 19792 (Advanced cAdvisor dashboard with support for Docker Compose projects)

Key cAdvisor Metrics to Know

When creating your own panels or alerts, look out for these fundamental metric names:

CPU Usage: container_cpu_usage_seconds_total (usually paired with rate() to calculate CPU percentage like sum(rate(container_cpu_usage_seconds_total[5m])) by (name))
Memory Usage: container_memory_usage_bytes (tells you the exact RAM consumption)
Network Traffic: container_network_receive_bytes_total and container_network_transmit_bytes_total
Disk I/O: container_fs_reads_bytes_total and container_fs_writes_bytes_total

Understanding Log Retention for Prometheus and Docker

May 8, 2026May 8, 2026 techhadoop docker, linux, monitoring

In may setup, “retention” actually applies to two different things: Prometheus metrics (the graphs) and Ubuntu system logs (the text files).

By default, they have very different lifespans.

1. Prometheus Metrics (The Graphs)

If you didn’t specify a retention time when you ran your Prometheus Docker container, it uses the default.

Default Duration: 15 days.
What happens after? Prometheus uses a “First-In-First-Out” system. Once data hits day 16, the data from day 1 is deleted to make room.
How to change it: If you want 30 days of history to show your Director month-over-month trends, you need to add this flag to your docker run command for Prometheus:--storage.tsdb.retention.time=30d

2. Ubuntu System Logs (`/var/log`)

This is handled by a service called logrotate. It manages things like your mail.log, syslog, and auth.log.

Default Duration: Usually 4 weeks (28 days).
How it works: It keeps 4 “rotated” files. Every Sunday, it compresses the current log and deletes the oldest one.
How to check your specific settings:Bashcat /etc/logrotate.d/rsyslog Look for the number next to rotate. If it says 4, and the interval is weekly, you have 28 days.

3. Docker Container Logs

This is the danger zone. By default, Docker container logs (like the ones for cadvisor or node-exporter) have no limit.

If a container starts throwing thousands of errors, the log file will grow until it fills your entire hard drive. Since we are doing a Pilot Group, you should verify your Docker logging driver.

The “Safe” way to run your containers:

Add these flags to your docker run commands to ensure you only keep 3 files of 10MB each:

Bash

--log-opt max-size=10m --log-opt max-file=3

Summary Table

Data Type	Default Retention	Controlled By
Prometheus Data	15 Days	`--storage.tsdb.retention.time`
System Logs	~28 Days	`/etc/logrotate.conf`
Docker Logs	Unlimited (Until disk is full)	Docker Log Driver

Recommendation for your 20 Servers

For an Executive Director’s report, 15 days is usually too short. Most admins prefer 90 days for Prometheus so they can compare “This Quarter vs. Last Quarter.”

To check how much disk space your current Prometheus data is taking:

Bash

du -sh /var/lib/docker/volumes/<prometheus_volume_name>/_data

To update your Prometheus retention to 90 days, you need to restart the container with a specific flag. Since you are likely running this via a docker run command, we will stop the old one and start the new one with the updated storage policy.

1. Update Prometheus to 90-Day Retention

Run these commands on your Central Monitoring Server:

Bash

			
# 1. Stop and remove the existing Prometheus container
docker stop prometheus
docker rm prometheus
# 2. Start it again with the 90-day retention flag
docker run -d \
  --name=prometheus \
  --restart=always \
  --publish=9090:9090 \
  -v /etc/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
  -v prometheus-data:/prometheus \
  prom/prometheus:latest \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/prometheus \
  --storage.tsdb.retention.time=90d

		

2. Why 90 Days? (The Strategy)

Setting it to 90 days is the “sweet spot” for several reasons:

Quarterly Reporting: You can show your Executive Director performance trends across a full 3-month business quarter.
Capacity Planning: 90 days of data allows the Predictive AI (the predict_linear we discussed) to be much more accurate because it has a longer history to analyze.
Storage Management: Prometheus is very efficient. For 20 servers, 90 days of metrics will likely only consume about 5GB to 10GB of disk space.

3. Verify the Change

Once the container is back up, you can verify that the new setting is active:

Open your browser to http://<CENTRAL_IP>:9090/status.
Look for Command-Line Flags.
Confirm that --storage.tsdb.retention.time=90d is listed there.

4. Pro-Tip: Disk Space Alert

Since you are now keeping 6 times more data than the default, you should add a simple alert in Grafana to monitor the Central Server’s own disk space.

The “Golden Rule” of Monitoring: The monitoring server must be the most stable server in the fleet. If its disk fills up because of long retention, you lose visibility into all other 20 servers.

Next Steps for your Project

Now that your data is safe for the long term:

Check the logs: docker logs prometheus to ensure there are no “permission denied” errors on the data folder.
Snapshot Check: Since you are modifying the central server, it wouldn’t hurt to take a quick vSphere snapshot of this one too, just like you did for the pilot nodes.

Setup Node Exporter for Centralized Monitoring

May 7, 2026May 28, 2026 techhadoop docker, linux

1. Run this on Linux 20 server(s)

I will provide ansible playbook in next post ( when you have a multiple severs, automation is the key)

			
docker run -d \
  --name=node-exporter \
  --restart=always \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  quay.io/prometheus/node-exporter:latest \
  --path.rootfs=/host

		

I would use :ro,rslave instead of only :ro, because the official Docker example for node_exporter uses bind mounting so the container can correctly see host mount points. Node Exporter is meant to monitor the host system, not just the container. (GitHub)

Check one server:

curl http://localhost:9100/metrics

From central Prometheus server:

curl http://SERVER_IP:9100/metrics

2. Open firewall only from Prometheus server

On each Linux host, allow port 9100 only from your central Prometheus server:

sudo ufw allow from PROMETHEUS_SERVER_IP to any port 9100 proto tcp

Do not expose 9100 publicly.

3. Central Prometheus config

On your central monitoring server, Prometheus scrapes all 20 Node Exporters.

prometheus.yml:

			
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: "linux_servers"
    static_configs:
      - targets:
          - "10.0.1.11:9100"
          - "10.0.1.12:9100"
          - "10.0.1.13:9100"
          - "10.0.1.14:9100"
          - "10.0.1.15:9100"
          # add all 20 servers here

		

Prometheus uses scrape_configs and targets to pull metrics from exporters. (Prometheus)

Restart Prometheus:

docker restart prometheus

4. Add Prometheus to Grafana

In Grafana:

			
Connections → Data sources → Prometheus
URL: http://PROMETHEUS_SERVER_IP:9090
Save & Test

Then import dashboard:

Dashboard ID: 1860

That is the popular Node Exporter Full dashboard. Example of dashboard

Final architecture

			
20 Linux Servers
   ↓ node-exporter :9100
Central Prometheus
   ↓
Grafana Dashboard

		

Important: Node Exporter does not send data to Grafana directly.
It exposes metrics, Prometheus pulls them, and Grafana visualizes Prometheus data.

cAdvisor: Your Guide to Container Monitoring

May 7, 2026May 7, 2026 techhadoop docker, linux, Uncategorized

cAdvisor Explained

What is cAdvisor?

cAdvisor (Container Advisor) is an open-source tool by Google that collects, aggregates, and exports resource usage and performance metrics from running containers. It gives you deep visibility into what every container on your host is doing.

			
┌─────────────────────────────────────────────────────────────┐
│                      LINUX HOST                             │
│                                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │Container │  │Container │  │Container │  │Container │   │
│  │  nginx   │  │  api     │  │ postgres │  │  redis   │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘   │
│       │              │              │              │         │
│       └──────────────┴──────────────┴──────────────┘        │
│                              │                              │
│                    ┌─────────▼─────────┐                    │
│                    │    cAdvisor        │                    │
│                    │                   │                    │
│                    │ reads cgroups     │                    │
│                    │ reads /proc       │                    │
│                    │ reads /sys        │                    │
│                    │ reads Docker API  │                    │
│                    └─────────┬─────────┘                    │
│                              │ exposes                      │
│                    ┌─────────▼─────────┐                    │
│                    │  :8080/metrics    │                    │
│                    │  (Prometheus fmt) │                    │
│                    └───────────────────┘                    │
└─────────────────────────────────────────────────────────────┘

		

How cAdvisor Works

			
Container Runtime (Docker / containerd)
          │
          │ Docker API / containerd API
          ▼
┌─────────────────────────────────────┐
│           cAdvisor                  │
│                                     │
│  ┌─────────────────────────────┐    │
│  │    Container Discovery      │    │
│  │  polls Docker API every 1s  │    │
│  │  detects start/stop         │    │
│  └──────────────┬──────────────┘    │
│                 │                   │
│  ┌──────────────▼──────────────┐    │
│  │    Metrics Collection       │    │
│  │  /sys/fs/cgroup   (limits)  │    │
│  │  /proc/<pid>/     (usage)   │    │
│  │  /sys/class/net/  (network) │    │
│  └──────────────┬──────────────┘    │
│                 │                   │
│  ┌──────────────▼──────────────┐    │
│  │    In-memory Storage        │    │
│  │  keeps ~2 min of history    │    │
│  └──────────────┬──────────────┘    │
│                 │                   │
│  ┌──────────────▼──────────────┐    │
│  │    Export Endpoints         │    │
│  │  /metrics  (Prometheus)     │    │
│  │  /api/v1.3 (REST API)       │    │
│  │  /containers (Web UI)       │    │
└──┴─────────────────────────────┴────┘

		

Deploy cAdvisor

Standalone Docker

			
# docker-compose.yml
cadvisor:
  image: gcr.io/cadvisor/cadvisor:v0.47.2
  container_name: cadvisor
  restart: unless-stopped
  ports:
    - "8080:8080"
  # Required volume mounts — read host filesystem
  volumes:
    - /:/rootfs:ro                          # root filesystem
    - /var/run:/var/run:ro                  # Docker socket dir
    - /var/run/docker.sock:/var/run/docker.sock:ro  # Docker API
    - /sys:/sys:ro                          # kernel/cgroups info
    - /var/lib/docker:/var/lib/docker:ro    # Docker data dir
    - /dev/disk:/dev/disk:ro                # disk info
  # Required for accessing kernel metrics
  privileged: true
  devices:
    - /dev/kmsg                             # kernel message buffer
  # Performance tuning
  command:
    - '--housekeeping_interval=10s'         # collect every 10s
    - '--max_housekeeping_interval=15s'
    - '--event_storage_event_limit=default=0'
    - '--event_storage_age_limit=default=0'
    - '--disable_metrics=percpu,sched,tcp,udp,disk,diskIO,hugetlb,referenced_memory,cpu_topology,resctrl'
    - '--docker_only=true'                  # only Docker containers
    - '--store_container_labels=false'

		

Kubernetes DaemonSet

			
# cadvisor runs on every node as a DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cadvisor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: cadvisor
  template:
    metadata:
      labels:
        app: cadvisor
    spec:
      hostNetwork: true
      hostPID: true
      containers:
      - name: cadvisor
        image: gcr.io/cadvisor/cadvisor:v0.47.2
        ports:
        - containerPort: 8080
          name: http
        volumeMounts:
        - name: rootfs
          mountPath: /rootfs
          readOnly: true
        - name: var-run
          mountPath: /var/run
          readOnly: true
        - name: sys
          mountPath: /sys
          readOnly: true
        - name: docker
          mountPath: /var/lib/docker
          readOnly: true
        - name: dev-disk
          mountPath: /dev/disk
          readOnly: true
        securityContext:
          privileged: true
        resources:
          requests:
            memory: 200Mi
            cpu: 150m
          limits:
            memory: 400Mi
            cpu: 300m
      volumes:
      - name: rootfs
        hostPath:
          path: /
      - name: var-run
        hostPath:
          path: /var/run
      - name: sys
        hostPath:
          path: /sys
      - name: docker
        hostPath:
          path: /var/lib/docker
      - name: dev-disk
        hostPath:
          path: /dev/disk

		

cAdvisor Web UI

Access at http://localhost:8080:

			
http://localhost:8080/containers/   → all containers overview
http://localhost:8080/docker/       → Docker-specific view
http://localhost:8080/metrics       → Prometheus metrics endpoint
Container detail page shows:
├── Isolation (CPU/memory limits set)
├── Usage (real-time CPU/memory charts)
├── Processes (running inside container)
└── Subcontainers (if applicable)

		

Key Metrics Exposed

cAdvisor exposes hundreds of metrics — here are the most important:

CPU Metrics

			
# ── Total CPU usage (all cores) ──────────────────────────────
# CPU seconds used — rate gives usage per second
container_cpu_usage_seconds_total{
  name="api",
  cpu="total"
}
# CPU usage % (actual percentage of one core)
rate(container_cpu_usage_seconds_total{
  name="api"
}[5m]) * 100
# CPU throttled time — how long container was throttled
container_cpu_cfs_throttled_seconds_total
# CPU throttle periods — how often throttled
container_cpu_cfs_throttled_periods_total
# CPU limit (from docker run --cpus)
container_spec_cpu_quota       # microseconds
container_spec_cpu_period      # period in microseconds
# CPU limit in cores
container_spec_cpu_quota / container_spec_cpu_period
# CPU usage % relative to limit
rate(container_cpu_usage_seconds_total{name="api"}[5m])
/ (container_spec_cpu_quota{name="api"}
   / container_spec_cpu_period{name="api"})
* 100

		

Memory Metrics

			
# ── Memory usage ─────────────────────────────────────────────
# Current memory usage (includes cache)
container_memory_usage_bytes{name="api"}
# Working set memory (excludes reclaimable cache)
# — best metric for actual memory pressure
container_memory_working_set_bytes{name="api"}
# RSS memory (resident set size — actual RAM used by app)
container_memory_rss{name="api"}
# Page cache (filesystem cache — reclaimable)
container_memory_cache{name="api"}
# Memory limit set on container
container_spec_memory_limit_bytes{name="api"}
# Memory usage % relative to limit
container_memory_working_set_bytes{name="api"}
/ container_spec_memory_limit_bytes{name="api"}
* 100
# Memory page faults (minor — no disk I/O)
container_memory_failures_total{
  name="api",
  type="pgfault",
  scope="container"
}
# Memory page faults (major — requires disk read)
container_memory_failures_total{
  name="api",
  type="pgmajfault",
  scope="container"
}

		

Network Metrics

			
# ── Network I/O ──────────────────────────────────────────────
# Bytes received per second
rate(container_network_receive_bytes_total{
  name="api"
}[5m])
# Bytes transmitted per second
rate(container_network_transmit_bytes_total{
  name="api"
}[5m])
# Packets received per second
rate(container_network_receive_packets_total{
  name="api"
}[5m])
# Packets transmitted per second
rate(container_network_transmit_packets_total{
  name="api"
}[5m])
# Receive errors
rate(container_network_receive_errors_total{
  name="api"
}[5m])
# Transmit errors
rate(container_network_transmit_errors_total{
  name="api"
}[5m])
# Dropped packets received
rate(container_network_receive_packets_dropped_total{
  name="api"
}[5m])

		

Disk / Filesystem Metrics

			
# ── Disk I/O ─────────────────────────────────────────────────
# Bytes read from disk per second
rate(container_fs_reads_bytes_total{
  name="api"
}[5m])
# Bytes written to disk per second
rate(container_fs_writes_bytes_total{
  name="api"
}[5m])
# Read operations per second (IOPS)
rate(container_fs_reads_total{
  name="api"
}[5m])
# Write operations per second (IOPS)
rate(container_fs_writes_total{
  name="api"
}[5m])
# Filesystem space used by container
container_fs_usage_bytes{
  name="api"
}
# Filesystem space limit
container_fs_limit_bytes{
  name="api"
}

		

Container Lifecycle Metrics

			
# ── Container state ──────────────────────────────────────────
# Container start time (unix timestamp)
container_start_time_seconds{name="api"}
# Container uptime in seconds
time() - container_start_time_seconds{name="api"}
# Last time container was seen alive
container_last_seen{name="api"}
# Detect container restarts (changes in start time)
changes(container_start_time_seconds{name="api"}[1h])

		

Important Metric Labels

cAdvisor adds rich labels to every metric:

			
container_cpu_usage_seconds_total{
  id="/docker/abc123",           # container ID path
  image="nginx:latest",          # image name
  name="my-nginx",               # container name
  container_label_com_docker_compose_project="myapp",
  container_label_com_docker_compose_service="nginx",
  container_label_com_docker_compose_version="2.0",
  cpu="total"
}

		

Label	Value example	Use
`name`	`my-nginx`	Filter by container name
`image`	`nginx:latest`	Filter by image
`id`	`/docker/abc123`	Unique container ID
`container_label_*`	compose project/service	Filter by compose labels
`interface`	`eth0`	Network interface
`device`	`/dev/sda`	Disk device

Prometheus Scrape Config for cAdvisor

			
# prometheus.yml
scrape_configs:
- job_name: 'cadvisor'
  static_configs:
  - targets: ['cadvisor:8080']
  # Drop metrics we don't need (reduce cardinality)
  metric_relabel_configs:
  # Drop pause containers (k8s infrastructure)
  - source_labels: [image]
    regex: 'k8s.gcr.io/pause.*'
    action: drop
  # Drop empty container names
  - source_labels: [name]
    regex: ''
    action: drop
  # Drop high-cardinality metrics not needed
  - source_labels: [__name__]
    regex: 'container_tasks_state|container_memory_failures_total'
    action: drop
  # Keep only Docker containers (not system cgroups)
  - source_labels: [container_label_com_docker_compose_service]
    regex: '.+'
    action: keep

		

Useful PromQL Queries

			
# ── Top Consumers ────────────────────────────────────────────
# Top 5 containers by CPU usage
topk(5,
  rate(container_cpu_usage_seconds_total{
    name!="", image!=""
  }[5m]) * 100
)
# Top 5 containers by memory (working set)
topk(5,
  container_memory_working_set_bytes{
    name!="", image!=""
  }
)
# Top 5 containers by network receive
topk(5,
  rate(container_network_receive_bytes_total{
    name!="", image!=""
  }[5m])
)
# Top 5 containers by disk writes
topk(5,
  rate(container_fs_writes_bytes_total{
    name!="", image!=""
  }[5m])
)
# ── Health Checks ────────────────────────────────────────────
# Containers using more than 80% of memory limit
container_memory_working_set_bytes{name!=""}
/ container_spec_memory_limit_bytes{name!=""} > 0.8
# Containers being CPU throttled
rate(container_cpu_cfs_throttled_seconds_total{
  name!=""
}[5m]) > 0
# Throttle % (how much CPU time is throttled)
rate(container_cpu_cfs_throttled_periods_total{
  name!=""
}[5m])
/ rate(container_cpu_cfs_periods_total{
  name!=""
}[5m])
* 100
# Containers that restarted in last hour
changes(container_start_time_seconds{
  name!="", image!=""
}[1h]) > 0
# ── Resource Efficiency ──────────────────────────────────────
# CPU limit utilization per container
rate(container_cpu_usage_seconds_total{name!=""}[5m])
/ (container_spec_cpu_quota{name!=""}
   / container_spec_cpu_period{name!=""})
* 100
# Memory limit utilization per container
container_memory_working_set_bytes{name!=""}
/ container_spec_memory_limit_bytes{name!=""}
* 100
# Containers with no resource limits set
container_spec_memory_limit_bytes == 0

		

cAdvisor Grafana Dashboard

Import dashboard ID 14282 or build panels manually:

			
Docker Overview Dashboard
├── Row 1: Summary Stats
│   ├── Total containers running (stat)
│   ├── Total CPU usage % (gauge)
│   ├── Total memory usage (gauge)
│   └── Total network I/O (stat)
│
├── Row 2: CPU
│   ├── CPU usage by container (time series, stacked)
│   ├── CPU throttling % by container (time series)
│   └── CPU limit utilization (bar gauge)
│
├── Row 3: Memory
│   ├── Memory usage by container (time series, stacked)
│   ├── Memory working set by container (time series)
│   ├── Memory limit utilization % (bar gauge)
│   └── OOM events (stat)
│
├── Row 4: Network
│   ├── Network received by container (time series)
│   ├── Network transmitted by container (time series)
│   ├── Network errors (time series)
│   └── Dropped packets (time series)
│
└── Row 5: Disk
    ├── Disk read bytes by container (time series)
    ├── Disk write bytes by container (time series)
    ├── Disk IOPS (time series)
    └── Container filesystem usage (bar gauge)

		

Alert Rules for cAdvisor

			
# prometheus/rules/cadvisor_alerts.yml
groups:
- name: cadvisor
  rules:
  # Container down
  - alert: ContainerDown
    expr: |
      time() - container_last_seen{
        name!="",
        image!=""
      } > 60
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Container down: {{ $labels.name }}"
      description: "Container has not been seen for 60 seconds"
  # High CPU throttling
  - alert: ContainerCPUThrottling
    expr: |
      rate(container_cpu_cfs_throttled_periods_total{name!=""}[5m])
      / rate(container_cpu_cfs_periods_total{name!=""}[5m])
      * 100 > 50
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "CPU throttling: {{ $labels.name }}"
      description: "{{ $value | printf \"%.0f\" }}% of CPU time is throttled"
  # High memory usage
  - alert: ContainerMemoryHigh
    expr: |
      container_memory_working_set_bytes{name!=""}
      / container_spec_memory_limit_bytes{name!=""}
      * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory: {{ $labels.name }}"
      description: "Memory usage is {{ $value | printf \"%.1f\" }}% of limit"
  # Container OOM risk
  - alert: ContainerOOMRisk
    expr: |
      container_memory_working_set_bytes{name!=""}
      / container_spec_memory_limit_bytes{name!=""}
      * 100 > 95
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "OOM risk: {{ $labels.name }}"
      description: "Memory at {{ $value | printf \"%.1f\" }}% — OOM kill imminent"
  # Container restarting
  - alert: ContainerRestarting
    expr: |
      changes(container_start_time_seconds{
        name!="", image!=""
      }[30m]) > 3
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "Container restarting: {{ $labels.name }}"
      description: "Restarted {{ $value }} times in last 30 minutes"
  # No CPU limit set
  - alert: ContainerNoCPULimit
    expr: |
      container_spec_cpu_quota{name!="", image!=""} == -1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "No CPU limit: {{ $labels.name }}"
      description: "Container has no CPU limit — can consume all host CPU"
  # No memory limit set
  - alert: ContainerNoMemoryLimit
    expr: |
      container_spec_memory_limit_bytes{
        name!="", image!=""
      } == 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "No memory limit: {{ $labels.name }}"
      description: "Container has no memory limit — OOM kill risk to host"

		

cAdvisor vs Node Exporter

They are complementary — not alternatives:

	Node Exporter	cAdvisor
Scope	Host / OS level	Container level
CPU metrics	Per core, per mode	Per container
Memory	Host RAM breakdown	Per container + limits
Network	Per NIC, host-level	Per container
Disk	Per device, per mount	Per container writes
Processes	Host process count	Container processes
Limits	N/A	CPU/memory limits & usage
Best for	Is the server healthy?	Which container is the problem?

			
Debugging workflow:
                                    
Node Exporter → "Host CPU is 95%"
                      ↓
cAdvisor → "api container using 80% of host CPU"
                      ↓
App metrics → "api processing 10k req/s, 50ms p99"
                      ↓
Root cause found

		

cAdvisor Limitations

Limitation	Workaround
Only ~2 min in-memory history	Use Prometheus for long-term storage
High metric cardinality with many containers	Drop unused metrics via relabeling
No application-level metrics	Add app-specific exporters
No log collection	Use Loki + Promtail alongside
No alerting	Use Prometheus Alertmanager
Resource overhead on busy hosts	Tune `--housekeeping_interval`
No cross-host aggregation	Prometheus federation or Thanos

Performance Tuning

			
# Reduce cAdvisor overhead on busy hosts
command:
  # Increase collection interval (default 1s)
  - '--housekeeping_interval=10s'
  # Disable metrics you don't need
  - '--disable_metrics=percpu,sched,tcp,udp,hugetlb,referenced_memory,cpu_topology,resctrl'
  # Only monitor Docker (not all cgroups)
  - '--docker_only=true'
  # Don't store container labels (reduce cardinality)
  - '--store_container_labels=false'
  # Allowlist only needed labels
  - '--allowlisted_container_labels=com.docker.compose.service,com.docker.compose.project'
  # Reduce in-memory storage
  - '--memory_storage_duration=1m'

		

cAdvisor is the standard tool for container-level observability — it answers the question “what is this specific container doing?” and is the foundation of container monitoring in both Docker and Kubernetes environments. Paired with Node Exporter for host metrics and Prometheus for storage, it gives you complete visibility from hardware up to individual container processes.

Monitor Linux and Docker with Grafana & Prometheus

May 6, 2026May 6, 2026 techhadoop docker, linux

Monitor Linux Server and Docker with Grafana and Prometheus

Architecture Overview

			
┌─────────────────────────────────────────────────────────────┐
│                     LINUX SERVER                             │
│                                                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │ Node Exporter│  │cAdvisor      │  │ Docker Engine    │   │
│  │              │  │              │  │ (metrics endpoint│   │
│  │ CPU/RAM/Disk │  │ Container    │  │  optional)       │   │
│  │ Network/FS   │  │ metrics      │  │                  │   │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────────┘   │
│         │                 │                  │               │
│         └─────────────────┴──────────────────┘               │
│                           │ scrape                           │
│                    ┌──────▼───────┐                          │
│                    │  Prometheus  │                          │
│                    │              │                          │
│                    │  stores      │                          │
│                    │  metrics     │                          │
│                    └──────┬───────┘                          │
│                           │ query                           │
│                    ┌──────▼───────┐                          │
│                    │   Grafana    │                          │
│                    │              │                          │
│                    │  dashboards  │                          │
│                    │  alerts      │                          │
│                    └──────────────┘                          │
└─────────────────────────────────────────────────────────────┘

		

Project Structure

			
monitoring/
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml
│   └── rules/
│       ├── linux_alerts.yml
│       └── docker_alerts.yml
├── grafana/
│   ├── provisioning/
│   │   ├── datasources/
│   │   │   └── prometheus.yml
│   │   └── dashboards/
│   │       └── dashboard.yml
│   └── dashboards/
│       ├── linux-server.json
│       └── docker.json
└── alertmanager/
    └── alertmanager.yml

		

Step 1 — Docker Compose Stack

			
# docker-compose.yml
version: '3.8'
networks:
  monitoring:
    driver: bridge
volumes:
  prometheus_data: {}
  grafana_data: {}
services:
  # ── Prometheus ───────────────────────────────────────────
  prometheus:
    image: prom/prometheus:v2.49.0
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules:/etc/prometheus/rules
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'    # keep 30 days
      - '--storage.tsdb.retention.size=10GB'
      - '--web.enable-lifecycle'               # hot reload config
      - '--web.enable-admin-api'
    networks:
      - monitoring
  # ── Grafana ──────────────────────────────────────────────
  grafana:
    image: grafana/grafana:10.2.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=SecurePass123
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_DOMAIN=grafana.yourdomain.com
      - GF_SMTP_ENABLED=true
      - GF_SMTP_HOST=smtp.gmail.com:587
      - GF_SMTP_USER=alerts@yourdomain.com
      - GF_SMTP_PASSWORD=your-smtp-password
      - GF_SMTP_FROM_ADDRESS=alerts@yourdomain.com
    networks:
      - monitoring
    depends_on:
      - prometheus
  # ── Node Exporter (Linux metrics) ────────────────────────
  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
      - '--collector.systemd'          # systemd service metrics
      - '--collector.processes'        # process metrics
    pid: host                          # see host processes
    network_mode: host                 # see host network stats
    cap_add:
      - SYS_TIME
  # ── cAdvisor (Docker container metrics) ──────────────────
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.2
    container_name: cadvisor
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /dev/disk:/dev/disk:ro
    privileged: true
    devices:
      - /dev/kmsg
    networks:
      - monitoring
  # ── Alertmanager ─────────────────────────────────────────
  alertmanager:
    image: prom/alertmanager:v0.26.0
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=http://alertmanager.yourdomain.com'
    networks:
      - monitoring

		

Step 2 — Prometheus Configuration

			
# prometheus/prometheus.yml
global:
  scrape_interval: 15s          # collect metrics every 15s
  evaluation_interval: 15s      # evaluate rules every 15s
  scrape_timeout: 10s
  external_labels:
    cluster: 'production'
    environment: 'prod'
# Alertmanager connection
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093
# Load alert rules
rule_files:
  - /etc/prometheus/rules/linux_alerts.yml
  - /etc/prometheus/rules/docker_alerts.yml
scrape_configs:
  # ── Prometheus self-monitoring ────────────────────────────
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
    metrics_path: /metrics
  # ── Linux Server (Node Exporter) ─────────────────────────
  - job_name: 'node-exporter'
    static_configs:
    - targets: ['node-exporter:9100']
      labels:
        server: 'linux-prod-01'
        env: 'production'
  # Multiple servers
  - job_name: 'linux-servers'
    static_configs:
    - targets:
      - '10.0.1.10:9100'
      - '10.0.1.11:9100'
      - '10.0.1.12:9100'
      labels:
        env: 'production'
    relabel_configs:
    - source_labels: [__address__]
      target_label: instance
      regex: '([^:]+):.*'
      replacement: '$1'
  # ── Docker Containers (cAdvisor) ─────────────────────────
  - job_name: 'cadvisor'
    static_configs:
    - targets: ['cadvisor:8080']
      labels:
        server: 'linux-prod-01'
    metric_relabel_configs:
    # Drop high-cardinality metrics we don't need
    - source_labels: [__name__]
      regex: 'container_tasks_state|container_memory_failures_total'
      action: drop
    # Keep only running containers
    - source_labels: [container_label_com_docker_compose_service]
      regex: '.+'
      action: keep
  # ── Docker Engine metrics (optional) ─────────────────────
  - job_name: 'docker-engine'
    static_configs:
    - targets: ['host.docker.internal:9323']
  # ── Grafana self-monitoring ───────────────────────────────
  - job_name: 'grafana'
    static_configs:
    - targets: ['grafana:3000']
    metrics_path: /metrics

		

Step 3 — Alert Rules

			
# prometheus/rules/linux_alerts.yml
groups:
- name: linux.server
  interval: 30s
  rules:
  # ── CPU Alerts ───────────────────────────────────────────
  - alert: HighCPUUsage
    expr: |
      100 - (avg by(instance) (
        rate(node_cpu_seconds_total{mode="idle"}[5m])
      ) * 100) > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is {{ $value | printf \"%.1f\" }}% (threshold: 85%)"
  - alert: CriticalCPUUsage
    expr: |
      100 - (avg by(instance) (
        rate(node_cpu_seconds_total{mode="idle"}[5m])
      ) * 100) > 95
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Critical CPU usage on {{ $labels.instance }}"
      description: "CPU usage is {{ $value | printf \"%.1f\" }}%"
  # ── Memory Alerts ─────────────────────────────────────────
  - alert: HighMemoryUsage
    expr: |
      (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"
      description: "Memory usage is {{ $value | printf \"%.1f\" }}%"
  - alert: CriticalMemoryUsage
    expr: |
      (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 95
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Critical memory usage on {{ $labels.instance }}"
      description: "Memory usage is {{ $value | printf \"%.1f\" }}%"
  # ── Disk Alerts ───────────────────────────────────────────
  - alert: DiskSpaceLow
    expr: |
      (node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} /
       node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}) * 100 < 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Disk {{ $labels.mountpoint }} has {{ $value | printf \"%.1f\" }}% free"
  - alert: DiskSpaceCritical
    expr: |
      (node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} /
       node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}) * 100 < 10
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Critical disk space on {{ $labels.instance }}"
      description: "Only {{ $value | printf \"%.1f\" }}% disk space remaining"
  - alert: DiskWillFillIn24h
    expr: |
      predict_linear(
        node_filesystem_avail_bytes{fstype!="tmpfs"}[6h], 24 * 3600
      ) < 0
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Disk will fill in 24h on {{ $labels.instance }}"
  # ── Network Alerts ────────────────────────────────────────
  - alert: HighNetworkErrors
    expr: |
      rate(node_network_receive_errs_total[5m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High network errors on {{ $labels.instance }}"
      description: "{{ $value | printf \"%.0f\" }} errors/sec on {{ $labels.device }}"
  # ── System Alerts ─────────────────────────────────────────
  - alert: ServerDown
    expr: up{job="node-exporter"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Server DOWN: {{ $labels.instance }}"
      description: "Node exporter is not reachable"
  - alert: HighLoadAverage
    expr: |
      node_load15 / count by(instance)(
        node_cpu_seconds_total{mode="idle"}
      ) > 0.9
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High load average on {{ $labels.instance }}"
      description: "15min load average is {{ $value | printf \"%.2f\" }} per core"
  - alert: SystemdServiceFailed
    expr: |
      node_systemd_unit_state{state="failed"} == 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Systemd service failed on {{ $labels.instance }}"
      description: "Service {{ $labels.name }} is in failed state"
  - alert: ClockSkewDetected
    expr: |
      abs(node_timex_offset_seconds) > 0.05
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Clock skew on {{ $labels.instance }}"
      description: "Clock offset is {{ $value }}s"

		

			
# prometheus/rules/docker_alerts.yml
groups:
- name: docker.containers
  rules:
  # ── Container Status Alerts ───────────────────────────────
  - alert: ContainerDown
    expr: |
      absent(container_last_seen{
        name!="",
        name!~".*_tmp.*"
      })
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Container down: {{ $labels.name }}"
  - alert: ContainerRestarting
    expr: |
      rate(container_last_seen{name!=""}[5m]) == 0
      and on(name)
      changes(container_last_seen{name!=""}[10m]) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container restarting: {{ $labels.name }}"
  # ── Container CPU Alerts ──────────────────────────────────
  - alert: ContainerHighCPU
    expr: |
      (rate(container_cpu_usage_seconds_total{
        name!="",
        image!=""
      }[5m]) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU in container {{ $labels.name }}"
      description: "Container CPU is {{ $value | printf \"%.1f\" }}%"
  # ── Container Memory Alerts ───────────────────────────────
  - alert: ContainerHighMemory
    expr: |
      (container_memory_usage_bytes{name!="", image!=""} /
       container_spec_memory_limit_bytes{name!="", image!=""} * 100) > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory in container {{ $labels.name }}"
      description: "Memory usage is {{ $value | printf \"%.1f\" }}% of limit"
  - alert: ContainerOOMKilled
    expr: |
      kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
      or
      container_oom_events_total > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "Container OOM killed: {{ $labels.name }}"
  # ── Container Disk Alerts ─────────────────────────────────
  - alert: ContainerHighDiskWrite
    expr: |
      rate(container_fs_writes_bytes_total{
        name!="",
        image!=""
      }[5m]) > 50000000    # 50MB/s
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High disk writes in container {{ $labels.name }}"
      description: "Writing {{ $value | humanize }}B/s"

		

Step 4 — Alertmanager Configuration

			
# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@yourdomain.com'
  smtp_auth_username: 'alerts@yourdomain.com'
  smtp_auth_password: 'your-app-password'
# Route tree
route:
  group_by: ['alertname', 'instance']
  group_wait: 30s          # wait before sending first alert
  group_interval: 5m       # wait between alert groups
  repeat_interval: 4h      # resend if still firing
  receiver: 'slack-default'
  routes:
  # Critical → PagerDuty + Slack immediately
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
    group_wait: 10s
    repeat_interval: 1h
    continue: true         # also send to default
  # Warning → Slack only
  - match:
      severity: warning
    receiver: 'slack-warnings'
    repeat_interval: 4h
  # Server down → immediate notification
  - match:
      alertname: ServerDown
    receiver: 'pagerduty-critical'
    group_wait: 0s         # no delay for server down
receivers:
# Default Slack channel
- name: 'slack-default'
  slack_configs:
  - channel: '#alerts'
    title: '{{ template "slack.title" . }}'
    text: '{{ template "slack.text" . }}'
    color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
    send_resolved: true
# Warnings channel
- name: 'slack-warnings'
  slack_configs:
  - channel: '#alerts-warning'
    title: '⚠️ {{ .GroupLabels.alertname }}'
    text: |
      {{ range .Alerts }}
      *Instance:* {{ .Labels.instance }}
      *Description:* {{ .Annotations.description }}
      {{ end }}
    send_resolved: true
# PagerDuty for critical
- name: 'pagerduty-critical'
  pagerduty_configs:
  - routing_key: 'YOUR_PAGERDUTY_ROUTING_KEY'
    description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
    severity: critical
# Email notifications
- name: 'email-alerts'
  email_configs:
  - to: 'devops-team@yourdomain.com'
    subject: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
    html: |
      <h2>{{ .GroupLabels.alertname }}</h2>
      {{ range .Alerts }}
      <p><b>Instance:</b> {{ .Labels.instance }}</p>
      <p><b>Description:</b> {{ .Annotations.description }}</p>
      {{ end }}
    send_resolved: true
inhibit_rules:
# If server is down, suppress all other alerts for that server
- source_match:
    alertname: ServerDown
  target_match_re:
    alertname: '.+'
  equal: ['instance']

		

Step 5 — Grafana Provisioning

			
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
  type: prometheus
  access: proxy
  url: http://prometheus:9090
  isDefault: true
  editable: false
  jsonData:
    timeInterval: "15s"
    queryTimeout: "60s"
    httpMethod: POST

		

			
# grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1
providers:
- name: 'default'
  orgId: 1
  folder: 'Server Monitoring'
  type: file
  disableDeletion: false
  editable: true
  updateIntervalSeconds: 30
  options:
    path: /var/lib/grafana/dashboards
    foldersFromFilesStructure: true

		

Step 6 — Key PromQL Queries

			
# ── CPU ────────────────────────────────────────────────────
# CPU usage % per server
100 - (avg by(instance) (
  rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100)
# CPU by mode (user, system, iowait, steal)
avg by(instance, mode) (
  rate(node_cpu_seconds_total{mode!="idle"}[5m])
) * 100
# Top 5 CPU-consuming containers
topk(5,
  rate(container_cpu_usage_seconds_total{
    name!="", image!=""
  }[5m]) * 100
)
# ── Memory ─────────────────────────────────────────────────
# Memory usage %
(1 - (node_memory_MemAvailable_bytes /
      node_memory_MemTotal_bytes)) * 100
# Memory breakdown (used, cached, buffered, free)
node_memory_MemTotal_bytes - node_memory_MemFree_bytes
  - node_memory_Buffers_bytes - node_memory_Cached_bytes
# Container memory usage vs limit
container_memory_usage_bytes{name!=""}
  / container_spec_memory_limit_bytes{name!=""} * 100
# ── Disk ───────────────────────────────────────────────────
# Disk usage % per mount
(1 - node_filesystem_avail_bytes{fstype!="tmpfs"} /
     node_filesystem_size_bytes{fstype!="tmpfs"}) * 100
# Disk I/O read/write bytes per second
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
# Disk I/O wait time (saturation)
rate(node_disk_io_time_seconds_total[5m])
# Predict disk full in hours
predict_linear(
  node_filesystem_avail_bytes{mountpoint="/"}[6h],
  3600
) / 1024 / 1024 / 1024   # convert to GB
# ── Network ────────────────────────────────────────────────
# Network bandwidth in/out per interface
rate(node_network_receive_bytes_total{
  device!~"lo|docker.*|veth.*"
}[5m]) * 8    # convert to bits
rate(node_network_transmit_bytes_total{
  device!~"lo|docker.*|veth.*"
}[5m]) * 8
# Network errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])
# ── Docker / Containers ────────────────────────────────────
# Running containers count
count(container_last_seen{name!="", image!=""})
# Container CPU usage %
rate(container_cpu_usage_seconds_total{
  name!="", image!=""
}[5m]) * 100
# Container network traffic
rate(container_network_receive_bytes_total{name!=""}[5m])
rate(container_network_transmit_bytes_total{name!=""}[5m])
# Container restart count
changes(container_start_time_seconds{name!=""}[1h])
# ── System ─────────────────────────────────────────────────
# System load per CPU core
node_load1 / count by(instance)(
  node_cpu_seconds_total{mode="idle"}
)
# Open file descriptors
node_filefd_allocated / node_filefd_maximum * 100
# System uptime in days
(time() - node_boot_time_seconds) / 86400
# Number of processes
node_procs_running
node_procs_blocked

		

Step 7 — Deploy and Manage

			
# Start the monitoring stack
docker-compose up -d
# Check all services running
docker-compose ps
# NAME            STATUS          PORTS
# prometheus      Up              0.0.0.0:9090->9090/tcp
# grafana         Up              0.0.0.0:3000->3000/tcp
# node-exporter   Up              0.0.0.0:9100->9100/tcp
# cadvisor        Up              0.0.0.0:8080->8080/tcp
# alertmanager    Up              0.0.0.0:9093->9093/tcp
# Check Prometheus targets (all should be UP)
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# Reload Prometheus config (without restart)
curl -X POST http://localhost:9090/-/reload
# Check Prometheus rules loaded
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].name'
# View Grafana logs
docker-compose logs grafana -f
# Check alertmanager config is valid
docker run --rm \
  -v $(pwd)/alertmanager:/config \
  prom/alertmanager:v0.26.0 \
  --config.file=/config/alertmanager.yml \
  check-config
# Backup Prometheus data
docker run --rm \
  -v prometheus_data:/data \
  -v $(pwd)/backup:/backup \
  alpine tar czf /backup/prometheus-$(date +%Y%m%d).tar.gz /data
# Update stack
docker-compose pull
docker-compose up -d

		

Step 8 — Install Node Exporter on Bare Metal

If Prometheus runs separately from the monitored server:

			
# Download and install node exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --web.listen-address=:9100
[Install]
WantedBy=multi-user.target
EOF
# Create user and start service
sudo useradd -rs /bin/false node_exporter
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
# Verify
curl http://localhost:9100/metrics | head -20

		

Grafana Dashboard Setup

			
# Access Grafana
open http://localhost:3000
# Login: admin / SecurePass123
# Import community dashboards via ID:
# Node Exporter Full:     ID 1860
# Docker monitoring:      ID 893
# cAdvisor:               ID 14282
# Prometheus stats:       ID 3662
# Import via CLI
curl -X POST \
  http://admin:SecurePass123@localhost:3000/api/dashboards/import \
  -H 'Content-Type: application/json' \
  -d '{
    "dashboard": {"id": null, "uid": null},
    "folderId": 0,
    "inputs": [{"name": "DS_PROMETHEUS", "type": "datasource",
                "pluginId": "prometheus", "value": "Prometheus"}],
    "overwrite": false,
    "path": "1860"
  }'

		

Useful Grafana Panel Examples

			
Linux Server Dashboard panels:
├── CPU Usage gauge          (0-100%, threshold at 80/95)
├── Memory Usage gauge       (0-100%, threshold at 80/95)
├── Disk Usage per mount     (bar gauge)
├── CPU Usage over time      (time series, stacked by mode)
├── Memory breakdown         (time series, stacked)
├── Network bandwidth        (time series, in/out)
├── Disk I/O                 (time series, read/write)
├── System Load              (time series, 1/5/15 min)
├── Top processes            (table)
└── Uptime stat              (stat panel)
Docker Dashboard panels:
├── Running containers       (stat)
├── Container CPU top 10     (bar chart)
├── Container Memory top 10  (bar chart)
├── Container restarts       (table with alert)
├── Network I/O per container(time series)
├── Container disk I/O       (time series)
└── Container logs           (logs panel via Loki)

		

Security Hardening

			
# Add basic auth to Prometheus
# Use nginx reverse proxy in front
nginx:
  image: nginx:alpine
  ports:
    - "9090:80"
  volumes:
    - ./nginx/prometheus.conf:/etc/nginx/conf.d/default.conf
    - ./nginx/.htpasswd:/etc/nginx/.htpasswd

		

			
# nginx/prometheus.conf
server {
  listen 80;
  location / {
    auth_basic "Prometheus";
    auth_basic_user_file /etc/nginx/.htpasswd;
    proxy_pass http://prometheus:9090;
  }
}

		

			
# Generate password file
htpasswd -c ./nginx/.htpasswd admin
# Restrict firewall — only allow internal access
ufw allow from 10.0.0.0/8 to any port 9090
ufw allow from 10.0.0.0/8 to any port 9100
ufw allow from 10.0.0.0/8 to any port 8080
ufw allow from anywhere to any port 3000   # Grafana only public

		

Quick Access URLs

			
Grafana:       http://localhost:3000      admin/SecurePass123
Prometheus:    http://localhost:9090
Alertmanager:  http://localhost:9093
Node Exporter: http://localhost:9100/metrics
cAdvisor:      http://localhost:8080

		

This gives you a production-ready monitoring stack — Linux server metrics via Node Exporter, Docker container metrics via cAdvisor, intelligent alerting via Alertmanager, and beautiful dashboards via Grafana, all wired together with Prometheus as the metrics backbone.

Docker Image Optimization: Best Practices & Tips

April 18, 2026April 18, 2026 techhadoop docker ai, artificial-intelligence, cloud, docker, technology

Here’s the practical best-practice checklist for building Docker images today:

Use a small, trusted base image and pin versions instead of relying on latest. Docker recommends choosing the right base image, keeping it small, and pinning base image versions for better security and repeatability. (Docker Documentation)

Use multi-stage builds so build tools never end up in the final runtime image. This is one of Docker’s main recommendations for producing smaller, cleaner, more secure images. (Docker Documentation)

Keep the build context small with a .dockerignore file. Excluding node_modules, .git, test artifacts, local env files, and temp files speeds builds and reduces accidental leakage into the image. Docker explicitly recommends using .dockerignore. (Docker Documentation)

Design your Dockerfile to maximize cache reuse. Copy dependency files first, install dependencies, then copy the rest of the app. Since Docker images are layer-based, ordering instructions well can make rebuilds much faster. (Docker Documentation)

Do not install unnecessary packages. Keep the image focused on one service, and remove build-only tools from the final stage. Docker also recommends creating ephemeral containers and decoupling applications where possible. (Docker Documentation)

Run the app as a non-root user whenever possible. Docker’s learning materials call out that a production-ready Dockerfile should improve security by running as non-root. (Docker Documentation)

Rebuild images regularly and use fresh base layers, especially for security patches. Docker recommends rebuilding often and using flags like --pull and, when needed, --no-cache for clean rebuilds. Also build and test images in CI. (Docker Documentation)

A solid production pattern looks like this:

			
# syntax=docker/dockerfile:1
FROM node:22-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
FROM node:22-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:22-alpine
WORKDIR /app
ENV NODE_ENV=production
COPY --from=deps /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
COPY package*.json ./
USER node
EXPOSE 3000
CMD ["node", "dist/server.js"]

		

A matching .dockerignore should usually include:

			
node_modules
npm-debug.log
.git
.gitignore
Dockerfile*
docker-compose*
.env
coverage
dist
tmp

		

For most teams, the simplest rule set is:

Small pinned base image
Multi-stage build
.dockerignore
Cache-friendly Dockerfile order
Non-root runtime
Rebuild in CI and scan often (Docker Documentation)

Absolutely — here’s a production-ready Docker image pattern you can reuse for most apps.

Good Dockerfile pattern

			
# syntax=docker/dockerfile:1
# 1) Install dependencies in a separate stage
FROM node:22-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci
# 2) Build the app
FROM node:22-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
# 3) Runtime image
FROM node:22-alpine AS runtime
WORKDIR /app
ENV NODE_ENV=production
# Create/use non-root runtime
USER node
# Copy only what is needed at runtime
COPY --chown=node:node --from=deps /app/node_modules ./node_modules
COPY --chown=node:node --from=build /app/dist ./dist
COPY --chown=node:node package*.json ./
EXPOSE 3000
CMD ["node", "dist/server.js"]

		

Matching `.dockerignore`

			
node_modules
npm-debug.log
.git
.gitignore
Dockerfile*
docker-compose*
.env
.env.*
coverage
dist
tmp
.vscode
.idea

		

Why this is a strong default

Docker’s current guidance recommends:

multi-stage builds to keep the final image smaller and cleaner (Docker Documentation)
using a .dockerignore file to keep the build context small and avoid sending unnecessary files to the builder (Docker Documentation)
structuring the Dockerfile for better cache reuse, like copying dependency manifests before app source (Docker Documentation)
running the app as a non-root user in production images (Docker Documentation)
avoiding secrets in ARG or ENV; Docker recommends using secret mounts instead because build args and env vars can be exposed in image metadata or the final image (Docker Documentation)

Even better build command

docker build --pull -t myapp:latest .

--pull helps refresh the base image layers so you don’t keep building on stale images, which aligns with Docker’s recommendation to rebuild often and keep base layers fresh. (Docker Documentation)

7 rules to follow every time

Pin the base image

FROM node:22.14-alpine

Do not use latest in production
Copy dependency files first

			
COPY package*.json ./
RUN npm ci
COPY . .

Only copy runtime artifacts into the final stage
Run as non-root
Keep secrets out of the Dockerfile
Keep one main responsibility per container when possible (Docker Documentation)

Common mistakes

Bad:

			
COPY . .
RUN npm install

Better:

			
COPY package*.json ./
RUN npm ci
COPY . .

Bad:

FROM node:latest

Better:

FROM node:22-alpine

Bad:

ENV API_KEY=secret123

Better: pass secrets at runtime or use Docker build secrets. (Docker Documentation)

If your app does not need Node at runtime

For frontend apps like React/Vite/Angular/Vue, it is often better to build in Node and serve with Nginx in the final stage, which Docker’s current framework guides also demonstrate for modern frontend apps. (Docker Documentation)

Best-practice summary

Use:

small pinned base image
multi-stage build
.dockerignore
cache-friendly layer order
non-root runtime
no secrets in ARG or ENV
regular rebuilds with fresh base layers (Docker Documentation)

Difference 1: Log Rotation (The Crucial Difference)

The First Command (Safe & Controlled)

The Second Command (The Hidden Danger)

Difference 2: Readability and Syntax

Summary Matrix

Which one should you use?

Centralized Architecture Overview

Step 1: Deploy cAdvisor on all 20 Remote Hosts

Step 2: Configure the Central Prometheus Server

Step 3: Multi-Host Grafana Dashboard

Pro-Tips for Managing 20+ Hosts

The Monitoring Pipeline

Quick Setup: Docker Compose Example

The Prometheus Config (prometheus.yml)

Visualizing with Grafana Dashboards

Recommended Pre-built Dashboards

Key cAdvisor Metrics to Know

1. Prometheus Metrics (The Graphs)

2. Ubuntu System Logs (/var/log)

3. Docker Container Logs

Summary Table

Recommendation for your 20 Servers

1. Update Prometheus to 90-Day Retention

2. Why 90 Days? (The Strategy)

3. Verify the Change

4. Pro-Tip: Disk Space Alert

Next Steps for your Project

1. Run this on Linux 20 server(s)

2. Open firewall only from Prometheus server

3. Central Prometheus config

4. Add Prometheus to Grafana

Final architecture

cAdvisor Explained

What is cAdvisor?

How cAdvisor Works

Deploy cAdvisor

Standalone Docker

Kubernetes DaemonSet

cAdvisor Web UI

Key Metrics Exposed

CPU Metrics

Memory Metrics

Network Metrics

Disk / Filesystem Metrics

Container Lifecycle Metrics

Important Metric Labels

Prometheus Scrape Config for cAdvisor

Useful PromQL Queries

cAdvisor Grafana Dashboard

Alert Rules for cAdvisor

cAdvisor vs Node Exporter

cAdvisor Limitations

Performance Tuning

Monitor Linux Server and Docker with Grafana and Prometheus

Architecture Overview

Project Structure

Step 1 — Docker Compose Stack

Step 2 — Prometheus Configuration

Step 3 — Alert Rules

Step 4 — Alertmanager Configuration

Step 5 — Grafana Provisioning

Step 6 — Key PromQL Queries

Step 7 — Deploy and Manage

Step 8 — Install Node Exporter on Bare Metal

Grafana Dashboard Setup

Useful Grafana Panel Examples

Security Hardening

Quick Access URLs

Good Dockerfile pattern

Matching .dockerignore

Why this is a strong default

Even better build command

7 rules to follow every time

Common mistakes

If your app does not need Node at runtime

Best-practice summary

The Prometheus Config (`prometheus.yml`)

2. Ubuntu System Logs (`/var/log`)

Matching `.dockerignore`