Monitor Linux Servers with Node Exporter Full

Node Exporter Full Dashboard Explained

What is Node Exporter Full?

Node Exporter Full is the most popular Grafana dashboard (ID: 1860) for Linux server monitoring. It provides a comprehensive view of every hardware and OS metric collected by Node Exporter — over 30 panels covering CPU, memory, disk, network, and system metrics.


Dashboard Overview

┌─────────────────────────────────────────────────────────────┐
│ NODE EXPORTER FULL — DASHBOARD LAYOUT │
├─────────────────────────────────────────────────────────────┤
│ [Server selector] [Time range] [Refresh interval] │
├──────────┬──────────┬──────────┬──────────┬────────────────┤
│ Uptime │CPU Cores │ RAM │ SWAP │ Root FS │
│ (stat) │ (stat) │ (stat) │ (stat) │ (stat) │
├──────────┴──────────┴──────────┴──────────┴────────────────┤
│ CPU USAGE (time series) │
├─────────────────────────────┬───────────────────────────────┤
│ CPU Basic (gauge) │ CPU Busy (time series) │
├─────────────────────────────┼───────────────────────────────┤
│ Memory Basic (gauge) │ Memory Usage (time series) │
├─────────────────────────────┴───────────────────────────────┤
│ DISK I/O (time series) │
├─────────────────────────────────────────────────────────────┤
│ NETWORK TRAFFIC (time series) │
├─────────────────────────────┬───────────────────────────────┤
│ Disk Space (bar gauge) │ Network Errors (time series)│
└─────────────────────────────┴───────────────────────────────┘

Section 1 — Quick Stats Row (Top)

The top row shows current snapshot values at a glance:

┌──────────┬──────────┬──────────┬──────────┬──────────────┐
│ Uptime │CPU Cores │ RAM │ SWAP │ Root FS │
│ 45 days │ 8 │ 31.2 GB │ 2.0 GB │ 234 GB │
└──────────┴──────────┴──────────┴──────────┴──────────────┘
Uptime
# How long the server has been running
(time() - node_boot_time_seconds{instance="$node", job="$job"})

Shows days, hours, minutes — quick health check. Server rebooted unexpectedly? Uptime drops.

CPU Cores
# Total logical CPU count
count(
count by(cpu) (
node_cpu_seconds_total{
instance="$node",
job="$job"
}
)
)
Total RAM
# Physical RAM in bytes
node_memory_MemTotal_bytes{
instance="$node",
job="$job"
}
SWAP Total
# Total swap space
node_memory_SwapTotal_bytes{
instance="$node",
job="$job"
}

High swap usage = memory pressure — app may be swapping pages to disk.

Root Filesystem
# Total size of root partition
node_filesystem_size_bytes{
instance="$node",
job="$job",
mountpoint="/",
fstype!="rootfs"
}

Section 2 — CPU Panels

CPU Basic (Gauge)
┌─────────────────────────────┐
│ CPU Basic │
│ │
│ 67% │
│ ████████████░░░░░ │
│ 0% [67%] 100% │
│ Green < 50 Yellow < 80 │
│ Red > 80 │
└─────────────────────────────┘
# Current overall CPU busy %
(1 - avg by(instance) (
rate(node_cpu_seconds_total{
instance="$node",
job="$job",
mode="idle"
}[$__rate_interval])
)) * 100

Thresholds typically set at:

  • 🟢 Green: 0-50%
  • 🟡 Yellow: 50-80%
  • 🔴 Red: 80-100%

CPU Usage Time Series

The most detailed CPU panel — shows how CPU time is being spent broken down by mode:

100% ┤ ┐
│ ██ steal │
80% ┤ ██ iowait │
│ ██ irq/softirq │
60% ┤ ██ system │
│ ████████ user │
40% ┤ ████████████████ │
│ ████████████████████ idle │
0% ┤────────────────────────────────────┘
12:00 13:00 14:00
# CPU user time (app code)
avg by(instance) (
rate(node_cpu_seconds_total{
instance="$node",
job="$job",
mode="user"
}[$__rate_interval])
) * 100
# CPU system time (kernel)
avg by(instance) (
rate(node_cpu_seconds_total{
instance="$node",
job="$job",
mode="system"
}[$__rate_interval])
) * 100
# CPU iowait (waiting for disk I/O)
avg by(instance) (
rate(node_cpu_seconds_total{
instance="$node",
job="$job",
mode="iowait"
}[$__rate_interval])
) * 100
# CPU steal (hypervisor stealing CPU from VM)
avg by(instance) (
rate(node_cpu_seconds_total{
instance="$node",
job="$job",
mode="steal"
}[$__rate_interval])
) * 100
# CPU softirq (network interrupts, timers)
avg by(instance) (
rate(node_cpu_seconds_total{
instance="$node",
job="$job",
mode="softirq"
}[$__rate_interval])
) * 100
Understanding CPU Modes
ModeMeaningHigh value means
userApp/process code runningHigh app activity — normal
systemKernel code runningMany syscalls, context switches
iowaitCPU idle waiting for I/ODisk bottleneck
stealHypervisor stealing CPUVM is being throttled by host
irqHardware interrupt handlingNetwork/disk interrupt storm
softirqSoftware interrupt handlingHigh network traffic
idleCPU doing nothingPlenty of headroom
niceLow-priority user processesBackground tasks running

CPU Busy by Core

Shows individual core utilization — detects uneven load distribution:

# Per-core CPU usage
(1 - rate(node_cpu_seconds_total{
instance="$node",
job="$job",
mode="idle"
}[$__rate_interval])) * 100
Core 0: ████████████████ 78%
Core 1: ████ 22%
Core 2: ████████████████████ 95% ← hot core
Core 3: ██ 10%

One hot core = single-threaded bottleneck. All cores high = genuinely compute-bound.


System Load Average
┌────────────────────────────────────────┐
│ System Load / CPU │
│ │
│ 1.2 ┤ ╭──╮ │
│ 0.8 ┤ ╭──╯ ╰──╮ 1min load │
│ 0.4 ┤───╯ ╰── 5min load │
│ 0.0 ┤ ── 15min load │
└────────────────────────────────────────┘
# Load average normalized per CPU core
node_load1{instance="$node", job="$job"}
/ count by(instance)(
node_cpu_seconds_total{
instance="$node",
job="$job",
mode="idle"
}
)
node_load5{instance="$node", job="$job"}
/ count by(instance)(...)
node_load15{instance="$node", job="$job"}
/ count by(instance)(...)

Interpreting load average:

< 1.0 per core = plenty of headroom
= 1.0 per core = fully utilized, no queue
> 1.0 per core = processes waiting for CPU
> 2.0 per core = severe overload

Context Switches and Interrupts
# Context switches per second
rate(node_context_switches_total{
instance="$node",
job="$job"
}[$__rate_interval])
# Hardware interrupts per second
rate(node_intr_total{
instance="$node",
job="$job"
}[$__rate_interval])

High context switches = many processes competing for CPU, or lots of I/O-bound processes sleeping and waking.


Section 3 — Memory Panels

Memory Basic (Gauge)
┌──────────────────────────────┐
│ Memory Basic │
│ │
│ 82% │
│ ████████████████░░ │
│ 0% [82%] 100% │
└──────────────────────────────┘
# RAM usage %
(1 - (
node_memory_MemAvailable_bytes{instance="$node", job="$job"}
/ node_memory_MemTotal_bytes{instance="$node", job="$job"}
)) * 100

Note: Uses MemAvailable not MemFree — available includes reclaimable cache, which is a better measure of actual free memory.


Memory Usage Breakdown (Time Series)
32GB ┤ ████████ Used (apps)
│ ████ Buffers
│ ██████████ Cached (filesystem cache)
│ ████ Free
0 ┤─────────────────────────────────────
# Actually used by apps (exclude cache/buffers)
node_memory_MemTotal_bytes{instance="$node", job="$job"}
- node_memory_MemFree_bytes{instance="$node", job="$job"}
- node_memory_Buffers_bytes{instance="$node", job="$job"}
- node_memory_Cached_bytes{instance="$node", job="$job"}
# Filesystem cache (reclaimable)
node_memory_Cached_bytes{instance="$node", job="$job"}
# Buffer cache (reclaimable)
node_memory_Buffers_bytes{instance="$node", job="$job"}
# Truly free
node_memory_MemFree_bytes{instance="$node", job="$job"}
# SWAP used
node_memory_SwapTotal_bytes{instance="$node", job="$job"}
- node_memory_SwapFree_bytes{instance="$node", job="$job"}
Understanding Memory Types
Memory TypeDescriptionConcern level
UsedActive app memory🔴 High if > 80% of total
CachedLinux filesystem cache🟢 Normal — reclaimable
BuffersDisk write buffers🟢 Normal — reclaimable
FreeCompletely unused🟢 Low is OK if cache is high
AvailableFree + reclaimable✅ Best indicator of real free
SWAP UsedMemory paged to disk🔴 Any non-zero is a warning

SWAP Activity
# Swap pages swapped in per second (bad — reading from disk)
rate(node_vmstat_pswpin{instance="$node", job="$job"}[$__rate_interval])
# Swap pages swapped out per second (bad — writing to disk)
rate(node_vmstat_pswpout{instance="$node", job="$job"}[$__rate_interval])

Any swap activity means the system is memory constrained — apps are being paged to disk, causing severe performance degradation.


Memory Pages
# Page faults per second
rate(node_vmstat_pgfault{
instance="$node",
job="$job"
}[$__rate_interval])
# Major page faults (require disk I/O — worse)
rate(node_vmstat_pgmajfault{
instance="$node",
job="$job"
}[$__rate_interval])

Section 4 — Disk Panels

Disk Space Used (Bar Gauge)

Shows all mounted filesystems and their usage:

/ ████████████████░░░░ 78% (234 GB / 300 GB)
/data ████████░░░░░░░░░░░░ 42% (420 GB / 1 TB)
/var/log ████████████████████ 96% ← critical!
/boot ████░░░░░░░░░░░░░░░░ 18% (180 MB / 1 GB)
# Disk usage % per mountpoint
(1 - node_filesystem_avail_bytes{
instance="$node",
job="$job",
fstype!~"tmpfs|fuse.lxcfs|squashfs|vfat"
} / node_filesystem_size_bytes{
instance="$node",
job="$job",
fstype!~"tmpfs|fuse.lxcfs|squashfs|vfat"
}) * 100

Disk I/O Time Series
200MB/s ┤ ╭──╮
│ reads │ │ writes
100MB/s ┤──────╮──╯ ╰──────╮
│ │ │
0 ┤──────╯ ╰───
12:00 13:00
# Disk read throughput (bytes/sec)
rate(node_disk_read_bytes_total{
instance="$node",
job="$job",
device=~"$disk"
}[$__rate_interval])
# Disk write throughput (bytes/sec)
rate(node_disk_written_bytes_total{
instance="$node",
job="$job",
device=~"$disk"
}[$__rate_interval])
# Read IOPS (operations per second)
rate(node_disk_reads_completed_total{
instance="$node",
job="$job",
device=~"$disk"
}[$__rate_interval])
# Write IOPS
rate(node_disk_writes_completed_total{
instance="$node",
job="$job",
device=~"$disk"
}[$__rate_interval])

Disk I/O Utilization (Saturation)
# % of time disk was busy (saturation)
rate(node_disk_io_time_seconds_total{
instance="$node",
job="$job",
device=~"$disk"
}[$__rate_interval]) * 100
Disk utilization interpretation:
0-40% = disk has plenty of headroom
40-80% = moderately busy
80-100% = disk is saturated — I/O bottleneck
> 100% = queue building up (very slow disk)

Disk I/O Wait Time
# Average read wait time (milliseconds)
rate(node_disk_read_time_seconds_total{
instance="$node",
job="$job"
}[$__rate_interval])
/ rate(node_disk_reads_completed_total{
instance="$node",
job="$job"
}[$__rate_interval])
* 1000
# Average write wait time (milliseconds)
rate(node_disk_write_time_seconds_total{
instance="$node",
job="$job"
}[$__rate_interval])
/ rate(node_disk_writes_completed_total{
instance="$node",
job="$job"
}[$__rate_interval])
* 1000
LatencyDisk typeConcern
< 1msNVMe SSD✅ Excellent
1-5msSSD✅ Good
5-20msSSD under load🟡 Acceptable
20-100msHDD or slow SSD🔴 Poor
> 100msSeverely overloaded🔴 Critical

Section 5 — Network Panels

Network Traffic (Time Series)
1 GB/s ┤ ╭──╮
│ ▲ received │ │
500MB/s┤──╮────────╮──╯ ╰──╮
│ │ ▼ sent│ │
0 ┤──╯ ╰─────────╯
12:00 13:00
# Network received bytes/sec (per interface)
rate(node_network_receive_bytes_total{
instance="$node",
job="$job",
device=~"$nic"
}[$__rate_interval])
# Network transmitted bytes/sec
rate(node_network_transmit_bytes_total{
instance="$node",
job="$job",
device=~"$nic"
}[$__rate_interval])
# Convert to bits (×8) for bandwidth comparison
rate(node_network_receive_bytes_total{...}[$__rate_interval]) * 8

Network Errors and Drops
# Receive errors per second
rate(node_network_receive_errs_total{
instance="$node",
job="$job",
device=~"$nic"
}[$__rate_interval])
# Transmit errors per second
rate(node_network_transmit_errs_total{
instance="$node",
job="$job",
device=~"$nic"
}[$__rate_interval])
# Receive packet drops (buffer overflow)
rate(node_network_receive_drop_total{
instance="$node",
job="$job",
device=~"$nic"
}[$__rate_interval])
# Transmit packet drops
rate(node_network_transmit_drop_total{
instance="$node",
job="$job",
device=~"$nic"
}[$__rate_interval])

Any non-zero errors or drops indicate:

  • Receive drops — NIC buffer overflow, CPU can’t process packets fast enough
  • Transmit errors — bad cable, network congestion, NIC issue
  • Errors — hardware problem, duplex mismatch

Network Packets
# Packets received per second
rate(node_network_receive_packets_total{
instance="$node",
job="$job",
device=~"$nic"
}[$__rate_interval])
# Packets transmitted per second
rate(node_network_transmit_packets_total{
instance="$node",
job="$job",
device=~"$nic"
}[$__rate_interval])

Section 6 — System Panels

Open File Descriptors
# Current open file descriptors
node_filefd_allocated{
instance="$node",
job="$job"
}
# System limit
node_filefd_maximum{
instance="$node",
job="$job"
}
# Usage percentage
node_filefd_allocated{instance="$node", job="$job"}
/ node_filefd_maximum{instance="$node", job="$job"}
* 100

Running out of file descriptors = apps fail to open files or sockets. Common with high-connection services like web servers, databases.


Processes

# Currently running processes (on CPU)
node_procs_running{instance="$node", job="$job"}
# Processes in uninterruptible sleep (D state — waiting for I/O)
node_procs_blocked{instance="$node", job="$job"}
# Total processes
node_procs_running + node_procs_blocked

High blocked processes = disk I/O bottleneck — processes stuck waiting for disk.


Systemd Failed Services

# Count of failed systemd services
count by(instance) (
node_systemd_unit_state{
instance="$node",
job="$job",
state="failed"
} == 1
)
# Which services failed
node_systemd_unit_state{
instance="$node",
job="$job",
state="failed"
} == 1

Dashboard Variables (Dropdowns)

The dashboard uses template variables so you can switch between servers:

Variable: $node
Query: label_values(node_uname_info, instance)
→ Dropdown shows all connected servers
Variable: $job
Query: label_values(node_uname_info{instance="$node"}, job)
→ Filters by job name
Variable: $disk
Query: label_values(node_disk_io_time_seconds_total
{instance="$node"}, device)
→ Dropdown of all disks (sda, sdb, nvme0n1 etc)
Variable: $nic
Query: label_values(node_network_info
{instance="$node"}, device)
→ Dropdown of all NICs (eth0, ens3 etc)
excluding: lo, docker, veth, br

Reading the Dashboard — What to Look For

Scenario 1: High CPU, low I/O wait
─────────────────────────────────
user% ████████████ 85% → App is CPU-bound
system ██ 10%
iowait ░ 2%
→ Scale up CPU or optimize app code
Scenario 2: High iowait, low user CPU
──────────────────────────────────────
user% ███ 20%
iowait ████████████ 70% → Disk is bottleneck
→ Check disk latency panel
→ Upgrade to SSD or optimize DB queries
Scenario 3: High memory, swap activity
────────────────────────────────────────
RAM used ███████████████ 95%
Swap used ████████ 40% → OOM risk
Swap I/O ██ active → Severe performance hit
→ Add RAM or reduce app memory usage
Scenario 4: Network drops increasing
──────────────────────────────────────
RX drops ████ increasing → NIC buffer overflow
→ Tune net.core.rmem_max
→ Check if CPU can keep up with IRQs
Scenario 5: Load > 1.0 per core, low CPU%
───────────────────────────────────────────
Load/core 1.8 → Processes queued
CPU user% 30% → Not CPU-bound
iowait% 60% → I/O queue is the bottleneck
→ Disk or network I/O is causing the queue

Import the Dashboard

# Method 1 Via Grafana UI
# Go to Dashboards Import Enter ID: 1860 Load
# Select Prometheus datasource Import
# Method 2 Via API
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"dashboard": {"id": null},
"inputs": [{
"name": "DS_PROMETHEUS",
"pluginId": "prometheus",
"type": "datasource",
"value": "Prometheus"
}],
"overwrite": true
}' \
http://admin:password@localhost:3000/api/dashboards/import
# Method 3 Provision via file (GitOps)
# Download dashboard JSON
curl https://grafana.com/api/dashboards/1860/revisions/latest/download \
-o grafana/dashboards/node-exporter-full.json
# Grafana picks it up automatically via provisioning config

Node Exporter Full is the single most useful starting point for Linux server monitoring — it gives you complete visibility into every layer of system performance from a single dashboard, with enough detail to diagnose almost any server issue without SSH-ing into the box.

Leave a Reply