What you will achieve
Live server monitoring with htop for processes and iostat for disk throughput — quick triage before Prometheus/Grafana exists.
1) Install tools
sudo apt install htop sysstat
# Fedora: sudo dnf install htop sysstat
2) htop usage
htop
F6 sort by CPU/MEM, F4 filter, F9 kill (careful). Press H to show threads.
3) iostat baseline
iostat -xz 1 5
Watch %util near 100% — disk saturated. await high — latency suffering.
4) Combine with history
sar -u 1 5
sar -d 1 5
Verify
Correlate high load with specific PIDs in htop and block devices in iostat — same time window.
5) htop batch mode for logs
htop -b -d 5 -n 3 > /tmp/htop-snapshot.txt
6) iotop permissions
sudo iotop -ao
Shows only processes doing I/O — quickly finds log spammer or runaway database.
7) Persistent sysstat history
sudo sed -i 's/ENABLED="false"/ENABLED="true"/' /etc/default/sysstat
sudo systemctl enable --now sysstat
sar -q # load average history
Baseline before incident
Capture normal iostat and htop during peak hours — without baseline you cannot tell if 40% disk util is normal or crisis.
8) atop for historical replay
sudo apt install atop
sudo systemctl enable --now atop
Prerequisites
SSH shell access. htop and sysstat packages. Baseline metrics documented. Optional: regular sar collection enabled. Know server role (web, DB, batch) to interpret CPU vs IO bottlenecks.
tmux for long captures
tmux new -s watch
htop
Detach and reconnect during long incidents — output survives SSH drop.
glances alternative
sudo apt install glances
glancesSingle TUI combining CPU, disk, net — quicker overview than switching htop and iostat manually.
Recording during incident
script -c 'iostat -xz 1 300' /tmp/iostat-incident.logCapture five minutes for post-mortem — attach to ticket with htop screenshot equivalent text log.
node_exporter migration path
Manual htop/iostat triage graduates to Prometheus node_exporter — same metrics automated. Until then sar history on sysstat enabled hosts gives post-incident graphs.
dstat combined view
sudo apt install dstat
dstat -cdngy 5Single line cpu disk net every 5s — paste into incident channel during outage bridge call.
vmstat run queue
vmstat 1 5Column 'b' blocked processes — correlates with iowait in iostat same interval — confirms IO not CPU bottleneck.
pidstat context switches
pidstat -w 1 5High cswch/s on one PID indicates lock contention — different fix than CPU or IO saturation.