name: server-health-audit category: devops description: Safe server environment audit and optimization — check services, clean stale configs, remove dead cron jobs, free disk space. Never breaks running services.
Server Health Audit — Safe Optimization Workflow
Trigger
When asked to check/optimize the server, or before making significant changes to a production environment.
Phase 1: Reconnaissance (Read-Only)
# System basics
uname -a
uptime
free -h
df -h
systemctl list-units --state=failed 2>/dev/null
# Running services
systemctl list-units --type=service --state=running --no-pager
# Cron jobs
crontab -l 2>/dev/null
ls /etc/cron.d/ 2>/dev/null
# Docker state
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Size}}"
docker system df
# Process memory (top 10)
ps aux --sort=-%mem | head -11
Process memory (top 10)
ps aux --sort=-%mem | head -11 ```### Phase 2: Identify Issues
Look for:
1. Failed/restart-looping services — systemctl list-units --state=failed, check journalctl -u <service> --no-pager -n 20
2. Stale cron jobs — paths that no longer exist: crontab -l | grep -v "^#" | awk '{print $NF}' | while read f; do [ -e "$f" ] || echo "MISSING: $f"; done
3. Dead Docker images — docker images | grep '<none>'
4. Stale config entries — check OpenClaw config for disabled plugins not cleaned from entries/installs/allow lists
5. Temp/backups — ls -lh /tmp/ /var/tmp/ 2>/dev/null | head -20
6. Conflicting services — nginx vs Caddy on same ports
7. Large directories — du -sh /root/* /var/www/* /opt/* 2>/dev/null | sort -rh | head -20
Phase 3: Safe Optimizations
MUST follow these rules:
- ✅ Safe: disable failed services, clean dead cron entries, remove stale config references, delete old backups in /tmp
- ❌ Never: restart running services, modify active configs without confirmation, delete data directories, change ports
- ⚠️ Always: backup configs before changing (cp config.json config.json.bak.$(date +%F))
Specific actions:
# Disable restart-looping service
systemctl disable --now <service>
# Clean dead cron jobs (replace entire crontab, don't use -r)
echo "" | crontab - # then re-add valid entries
# Remove stale OpenClaw plugin entries from openclaw.json
# Edit: entries, installs, allow lists
# Clean Docker dangling images
docker image prune -f
aw.json
# Edit: entries, installs, allow lists
# Clean Docker dangling images
docker image prune -f# Clean temp backups older than 7 days
find /tmp -name 'backup-*' -mtime +7 -delete
Phase 4: Report
Always output: - Current system state (CPU/mem/disk/uptime) - Running services table - What was fixed (with before/after if applicable) - What was NOT touched and why - Open issues requiring user confirmation
Pitfalls
- OpenClaw config hot-reload: OpenClaw detects config changes but waits until current task completes to restart. If restart fails, config backups are in
/etc/lighthouse/openclaw/bak/ - Don't kill Docker containers: OpenWebUI and OpenClaw Gateway run in Docker — always use
systemctlnotkill - Cron: use
crontab -notcrontab -r:-rremoves ALL jobs including system ones. Always pipe replacement content. - Memory is tight on 3-4GB servers: OpenWebUI (~659MB) + OpenClaw (~654MB) + Hermes (~272MB) = ~1.6GB baseline. Leave at least 500MB free.