What you will achieve
Interpret load average and track down CPU, I/O, or lock contention causing sluggish servers — without rebooting blindly.
1) Read load in context
uptime
nproc
top -b -n1 | head -20
Load of 8 on a 4-core box means contention; on 16 cores it may be fine.
2) CPU vs I/O wait
sudo apt install sysstat
iostat -xz 1 5
pidstat -d 1 5
High %wa in iostat → disk bottleneck. High CPU → check ps aux --sort=-%cpu | head.
3) Specific offenders
systemctl list-units --state=running
sudo iotop -o
4) Mitigate
- Restart runaway service after identifying root cause in logs.
- Add swap or fix memory leak if OOM killer is thrashing.
- Schedule heavy cron jobs off-peak.
Verify
uptime
iostat -c 1 3
5) Zombie processes
ps aux | awk '$8 ~ /Z/ {print}'
Zombies indicate parent not reaping children — restart parent service, not zombies themselves.
6) Memory pressure
free -h
vmstat 1 5
High swap churn with low free RAM — add memory or fix leak. oom_score_adj protects critical daemons.
7) IRQ saturation
cat /proc/interrupts
mpstat -I SUM 1 3
10 Gbit NIC on single queue can spike softirq — consider RPS/RFS or better NIC drivers.
When load is acceptable
Batch jobs intentionally peg CPU — load 32 on 32 cores during ffmpeg transcode is fine. Context matters more than absolute numbers.
8) Transparent huge pages databases
PostgreSQL and MongoDB docs often recommend disabling THP — check vendor tuning guides if DB is the CPU hog under load.
Prerequisites
SSH access during slowness, sysstat and htop installed, baseline knowledge of normal load for this host role. Change window if restart required.
Save evidence before kill -9
ps aux --sort=-%cpu | head -20 > /tmp/top-cpu.txt
sudo perf top -d 5
Supports post-mortem after killing runaway process.
Blocked processes D state
ps aux | awk '$8=="D"'Uninterruptible sleep usually I/O wait on NFS or dying disk — killing does not work, fix storage.
cgroup v2 pressure
cat /proc/pressure/cpu
cat /proc/pressure/ioPSI metrics on systemd 250+ hosts quantify resource pressure better than load average alone — integrate with monitoring before load spikes become outages. Kubernetes nodes showing high load may be kubelet or eviction pressure — check kubectl top nodes separately from host uptime.
Nice and ionice batch jobs
nice -n 19 ionice -c3 batch.shLowers batch impact during business hours without cancelling job — pair with cgroup CPUWeight on systemd service for finer control.