Node Exporter Full Dashboard Explained
What is Node Exporter Full?
Node Exporter Full is the most popular Grafana dashboard (ID: 1860) for Linux server monitoring. It provides a comprehensive view of every hardware and OS metric collected by Node Exporter — over 30 panels covering CPU, memory, disk, network, and system metrics.
Dashboard Overview
┌─────────────────────────────────────────────────────────────┐│ NODE EXPORTER FULL — DASHBOARD LAYOUT │├─────────────────────────────────────────────────────────────┤│ [Server selector] [Time range] [Refresh interval] │├──────────┬──────────┬──────────┬──────────┬────────────────┤│ Uptime │CPU Cores │ RAM │ SWAP │ Root FS ││ (stat) │ (stat) │ (stat) │ (stat) │ (stat) │├──────────┴──────────┴──────────┴──────────┴────────────────┤│ CPU USAGE (time series) │├─────────────────────────────┬───────────────────────────────┤│ CPU Basic (gauge) │ CPU Busy (time series) │├─────────────────────────────┼───────────────────────────────┤│ Memory Basic (gauge) │ Memory Usage (time series) │├─────────────────────────────┴───────────────────────────────┤│ DISK I/O (time series) │├─────────────────────────────────────────────────────────────┤│ NETWORK TRAFFIC (time series) │├─────────────────────────────┬───────────────────────────────┤│ Disk Space (bar gauge) │ Network Errors (time series)│└─────────────────────────────┴───────────────────────────────┘
Section 1 — Quick Stats Row (Top)
The top row shows current snapshot values at a glance:
┌──────────┬──────────┬──────────┬──────────┬──────────────┐│ Uptime │CPU Cores │ RAM │ SWAP │ Root FS ││ 45 days │ 8 │ 31.2 GB │ 2.0 GB │ 234 GB │└──────────┴──────────┴──────────┴──────────┴──────────────┘
Uptime
# How long the server has been running(time() - node_boot_time_seconds{instance="$node", job="$job"})
Shows days, hours, minutes — quick health check. Server rebooted unexpectedly? Uptime drops.
CPU Cores
# Total logical CPU countcount( count by(cpu) ( node_cpu_seconds_total{ instance="$node", job="$job" } ))
Total RAM
# Physical RAM in bytesnode_memory_MemTotal_bytes{ instance="$node", job="$job"}
SWAP Total
# Total swap spacenode_memory_SwapTotal_bytes{ instance="$node", job="$job"}
High swap usage = memory pressure — app may be swapping pages to disk.
Root Filesystem
# Total size of root partitionnode_filesystem_size_bytes{ instance="$node", job="$job", mountpoint="/", fstype!="rootfs"}
Section 2 — CPU Panels
CPU Basic (Gauge)
┌─────────────────────────────┐│ CPU Basic ││ ││ 67% ││ ████████████░░░░░ ││ 0% [67%] 100% ││ Green < 50 Yellow < 80 ││ Red > 80 │└─────────────────────────────┘
# Current overall CPU busy %(1 - avg by(instance) ( rate(node_cpu_seconds_total{ instance="$node", job="$job", mode="idle" }[$__rate_interval]))) * 100
Thresholds typically set at:
- 🟢 Green: 0-50%
- 🟡 Yellow: 50-80%
- 🔴 Red: 80-100%
CPU Usage Time Series
The most detailed CPU panel — shows how CPU time is being spent broken down by mode:
100% ┤ ┐ │ ██ steal │ 80% ┤ ██ iowait │ │ ██ irq/softirq │ 60% ┤ ██ system │ │ ████████ user │ 40% ┤ ████████████████ │ │ ████████████████████ idle │ 0% ┤────────────────────────────────────┘ 12:00 13:00 14:00
# CPU user time (app code)avg by(instance) ( rate(node_cpu_seconds_total{ instance="$node", job="$job", mode="user" }[$__rate_interval])) * 100# CPU system time (kernel)avg by(instance) ( rate(node_cpu_seconds_total{ instance="$node", job="$job", mode="system" }[$__rate_interval])) * 100# CPU iowait (waiting for disk I/O)avg by(instance) ( rate(node_cpu_seconds_total{ instance="$node", job="$job", mode="iowait" }[$__rate_interval])) * 100# CPU steal (hypervisor stealing CPU from VM)avg by(instance) ( rate(node_cpu_seconds_total{ instance="$node", job="$job", mode="steal" }[$__rate_interval])) * 100# CPU softirq (network interrupts, timers)avg by(instance) ( rate(node_cpu_seconds_total{ instance="$node", job="$job", mode="softirq" }[$__rate_interval])) * 100
Understanding CPU Modes
| Mode | Meaning | High value means |
|---|---|---|
| user | App/process code running | High app activity — normal |
| system | Kernel code running | Many syscalls, context switches |
| iowait | CPU idle waiting for I/O | Disk bottleneck |
| steal | Hypervisor stealing CPU | VM is being throttled by host |
| irq | Hardware interrupt handling | Network/disk interrupt storm |
| softirq | Software interrupt handling | High network traffic |
| idle | CPU doing nothing | Plenty of headroom |
| nice | Low-priority user processes | Background tasks running |
CPU Busy by Core
Shows individual core utilization — detects uneven load distribution:
# Per-core CPU usage(1 - rate(node_cpu_seconds_total{ instance="$node", job="$job", mode="idle"}[$__rate_interval])) * 100
Core 0: ████████████████ 78%Core 1: ████ 22%Core 2: ████████████████████ 95% ← hot coreCore 3: ██ 10%
One hot core = single-threaded bottleneck. All cores high = genuinely compute-bound.
System Load Average
┌────────────────────────────────────────┐│ System Load / CPU ││ ││ 1.2 ┤ ╭──╮ ││ 0.8 ┤ ╭──╯ ╰──╮ 1min load ││ 0.4 ┤───╯ ╰── 5min load ││ 0.0 ┤ ── 15min load │└────────────────────────────────────────┘
# Load average normalized per CPU corenode_load1{instance="$node", job="$job"} / count by(instance)( node_cpu_seconds_total{ instance="$node", job="$job", mode="idle" } )node_load5{instance="$node", job="$job"} / count by(instance)(...)node_load15{instance="$node", job="$job"} / count by(instance)(...)
Interpreting load average:
< 1.0 per core = plenty of headroom= 1.0 per core = fully utilized, no queue> 1.0 per core = processes waiting for CPU> 2.0 per core = severe overload
Context Switches and Interrupts
# Context switches per secondrate(node_context_switches_total{ instance="$node", job="$job"}[$__rate_interval])# Hardware interrupts per secondrate(node_intr_total{ instance="$node", job="$job"}[$__rate_interval])
High context switches = many processes competing for CPU, or lots of I/O-bound processes sleeping and waking.
Section 3 — Memory Panels
Memory Basic (Gauge)
┌──────────────────────────────┐│ Memory Basic ││ ││ 82% ││ ████████████████░░ ││ 0% [82%] 100% │└──────────────────────────────┘
# RAM usage %(1 - ( node_memory_MemAvailable_bytes{instance="$node", job="$job"} / node_memory_MemTotal_bytes{instance="$node", job="$job"})) * 100
Note: Uses MemAvailable not MemFree — available includes reclaimable cache, which is a better measure of actual free memory.
Memory Usage Breakdown (Time Series)
32GB ┤ ████████ Used (apps) │ ████ Buffers │ ██████████ Cached (filesystem cache) │ ████ Free 0 ┤─────────────────────────────────────
# Actually used by apps (exclude cache/buffers)node_memory_MemTotal_bytes{instance="$node", job="$job"}- node_memory_MemFree_bytes{instance="$node", job="$job"}- node_memory_Buffers_bytes{instance="$node", job="$job"}- node_memory_Cached_bytes{instance="$node", job="$job"}# Filesystem cache (reclaimable)node_memory_Cached_bytes{instance="$node", job="$job"}# Buffer cache (reclaimable)node_memory_Buffers_bytes{instance="$node", job="$job"}# Truly freenode_memory_MemFree_bytes{instance="$node", job="$job"}# SWAP usednode_memory_SwapTotal_bytes{instance="$node", job="$job"}- node_memory_SwapFree_bytes{instance="$node", job="$job"}
Understanding Memory Types
| Memory Type | Description | Concern level |
|---|---|---|
| Used | Active app memory | 🔴 High if > 80% of total |
| Cached | Linux filesystem cache | 🟢 Normal — reclaimable |
| Buffers | Disk write buffers | 🟢 Normal — reclaimable |
| Free | Completely unused | 🟢 Low is OK if cache is high |
| Available | Free + reclaimable | ✅ Best indicator of real free |
| SWAP Used | Memory paged to disk | 🔴 Any non-zero is a warning |
SWAP Activity
# Swap pages swapped in per second (bad — reading from disk)rate(node_vmstat_pswpin{instance="$node", job="$job"}[$__rate_interval])# Swap pages swapped out per second (bad — writing to disk)rate(node_vmstat_pswpout{instance="$node", job="$job"}[$__rate_interval])
Any swap activity means the system is memory constrained — apps are being paged to disk, causing severe performance degradation.
Memory Pages
# Page faults per secondrate(node_vmstat_pgfault{ instance="$node", job="$job"}[$__rate_interval])# Major page faults (require disk I/O — worse)rate(node_vmstat_pgmajfault{ instance="$node", job="$job"}[$__rate_interval])
Section 4 — Disk Panels
Disk Space Used (Bar Gauge)
Shows all mounted filesystems and their usage:
/ ████████████████░░░░ 78% (234 GB / 300 GB)/data ████████░░░░░░░░░░░░ 42% (420 GB / 1 TB)/var/log ████████████████████ 96% ← critical!/boot ████░░░░░░░░░░░░░░░░ 18% (180 MB / 1 GB)
# Disk usage % per mountpoint(1 - node_filesystem_avail_bytes{ instance="$node", job="$job", fstype!~"tmpfs|fuse.lxcfs|squashfs|vfat"} / node_filesystem_size_bytes{ instance="$node", job="$job", fstype!~"tmpfs|fuse.lxcfs|squashfs|vfat"}) * 100
Disk I/O Time Series
200MB/s ┤ ╭──╮ │ reads │ │ writes100MB/s ┤──────╮──╯ ╰──────╮ │ │ │ 0 ┤──────╯ ╰─── 12:00 13:00
# Disk read throughput (bytes/sec)rate(node_disk_read_bytes_total{ instance="$node", job="$job", device=~"$disk"}[$__rate_interval])# Disk write throughput (bytes/sec)rate(node_disk_written_bytes_total{ instance="$node", job="$job", device=~"$disk"}[$__rate_interval])# Read IOPS (operations per second)rate(node_disk_reads_completed_total{ instance="$node", job="$job", device=~"$disk"}[$__rate_interval])# Write IOPSrate(node_disk_writes_completed_total{ instance="$node", job="$job", device=~"$disk"}[$__rate_interval])
Disk I/O Utilization (Saturation)
# % of time disk was busy (saturation)rate(node_disk_io_time_seconds_total{ instance="$node", job="$job", device=~"$disk"}[$__rate_interval]) * 100
Disk utilization interpretation:0-40% = disk has plenty of headroom40-80% = moderately busy80-100% = disk is saturated — I/O bottleneck> 100% = queue building up (very slow disk)
Disk I/O Wait Time
# Average read wait time (milliseconds)rate(node_disk_read_time_seconds_total{ instance="$node", job="$job"}[$__rate_interval])/ rate(node_disk_reads_completed_total{ instance="$node", job="$job"}[$__rate_interval])* 1000# Average write wait time (milliseconds)rate(node_disk_write_time_seconds_total{ instance="$node", job="$job"}[$__rate_interval])/ rate(node_disk_writes_completed_total{ instance="$node", job="$job"}[$__rate_interval])* 1000
| Latency | Disk type | Concern |
|---|---|---|
| < 1ms | NVMe SSD | ✅ Excellent |
| 1-5ms | SSD | ✅ Good |
| 5-20ms | SSD under load | 🟡 Acceptable |
| 20-100ms | HDD or slow SSD | 🔴 Poor |
| > 100ms | Severely overloaded | 🔴 Critical |
Section 5 — Network Panels
Network Traffic (Time Series)
1 GB/s ┤ ╭──╮ │ ▲ received │ │500MB/s┤──╮────────╮──╯ ╰──╮ │ │ ▼ sent│ │ 0 ┤──╯ ╰─────────╯ 12:00 13:00
# Network received bytes/sec (per interface)rate(node_network_receive_bytes_total{ instance="$node", job="$job", device=~"$nic"}[$__rate_interval])# Network transmitted bytes/secrate(node_network_transmit_bytes_total{ instance="$node", job="$job", device=~"$nic"}[$__rate_interval])# Convert to bits (×8) for bandwidth comparisonrate(node_network_receive_bytes_total{...}[$__rate_interval]) * 8
Network Errors and Drops
# Receive errors per secondrate(node_network_receive_errs_total{ instance="$node", job="$job", device=~"$nic"}[$__rate_interval])# Transmit errors per secondrate(node_network_transmit_errs_total{ instance="$node", job="$job", device=~"$nic"}[$__rate_interval])# Receive packet drops (buffer overflow)rate(node_network_receive_drop_total{ instance="$node", job="$job", device=~"$nic"}[$__rate_interval])# Transmit packet dropsrate(node_network_transmit_drop_total{ instance="$node", job="$job", device=~"$nic"}[$__rate_interval])
Any non-zero errors or drops indicate:
- Receive drops — NIC buffer overflow, CPU can’t process packets fast enough
- Transmit errors — bad cable, network congestion, NIC issue
- Errors — hardware problem, duplex mismatch
Network Packets
# Packets received per secondrate(node_network_receive_packets_total{ instance="$node", job="$job", device=~"$nic"}[$__rate_interval])# Packets transmitted per secondrate(node_network_transmit_packets_total{ instance="$node", job="$job", device=~"$nic"}[$__rate_interval])
Section 6 — System Panels
Open File Descriptors
# Current open file descriptorsnode_filefd_allocated{ instance="$node", job="$job"}# System limitnode_filefd_maximum{ instance="$node", job="$job"}# Usage percentagenode_filefd_allocated{instance="$node", job="$job"}/ node_filefd_maximum{instance="$node", job="$job"}* 100
Running out of file descriptors = apps fail to open files or sockets. Common with high-connection services like web servers, databases.
Processes
# Currently running processes (on CPU)node_procs_running{instance="$node", job="$job"}# Processes in uninterruptible sleep (D state — waiting for I/O)node_procs_blocked{instance="$node", job="$job"}# Total processesnode_procs_running + node_procs_blocked
High blocked processes = disk I/O bottleneck — processes stuck waiting for disk.
Systemd Failed Services
# Count of failed systemd servicescount by(instance) ( node_systemd_unit_state{ instance="$node", job="$job", state="failed" } == 1)# Which services failednode_systemd_unit_state{ instance="$node", job="$job", state="failed"} == 1
Dashboard Variables (Dropdowns)
The dashboard uses template variables so you can switch between servers:
Variable: $nodeQuery: label_values(node_uname_info, instance)→ Dropdown shows all connected serversVariable: $jobQuery: label_values(node_uname_info{instance="$node"}, job)→ Filters by job nameVariable: $diskQuery: label_values(node_disk_io_time_seconds_total {instance="$node"}, device)→ Dropdown of all disks (sda, sdb, nvme0n1 etc)Variable: $nicQuery: label_values(node_network_info {instance="$node"}, device)→ Dropdown of all NICs (eth0, ens3 etc) excluding: lo, docker, veth, br
Reading the Dashboard — What to Look For
Scenario 1: High CPU, low I/O wait─────────────────────────────────user% ████████████ 85% → App is CPU-boundsystem ██ 10%iowait ░ 2%→ Scale up CPU or optimize app codeScenario 2: High iowait, low user CPU──────────────────────────────────────user% ███ 20%iowait ████████████ 70% → Disk is bottleneck→ Check disk latency panel→ Upgrade to SSD or optimize DB queriesScenario 3: High memory, swap activity────────────────────────────────────────RAM used ███████████████ 95%Swap used ████████ 40% → OOM riskSwap I/O ██ active → Severe performance hit→ Add RAM or reduce app memory usageScenario 4: Network drops increasing──────────────────────────────────────RX drops ████ increasing → NIC buffer overflow→ Tune net.core.rmem_max→ Check if CPU can keep up with IRQsScenario 5: Load > 1.0 per core, low CPU%───────────────────────────────────────────Load/core 1.8 → Processes queuedCPU user% 30% → Not CPU-boundiowait% 60% → I/O queue is the bottleneck→ Disk or network I/O is causing the queue
Import the Dashboard
# Method 1 — Via Grafana UI# Go to Dashboards → Import → Enter ID: 1860 → Load# Select Prometheus datasource → Import# Method 2 — Via APIcurl -X POST \ -H "Content-Type: application/json" \ -d '{ "dashboard": {"id": null}, "inputs": [{ "name": "DS_PROMETHEUS", "pluginId": "prometheus", "type": "datasource", "value": "Prometheus" }], "overwrite": true }' \ http://admin:password@localhost:3000/api/dashboards/import# Method 3 — Provision via file (GitOps)# Download dashboard JSONcurl https://grafana.com/api/dashboards/1860/revisions/latest/download \ -o grafana/dashboards/node-exporter-full.json# Grafana picks it up automatically via provisioning config
Node Exporter Full is the single most useful starting point for Linux server monitoring — it gives you complete visibility into every layer of system performance from a single dashboard, with enough detail to diagnose almost any server issue without SSH-ing into the box.