name: server-health-audit category: devops description: Safe server environment audit and optimization — check services, clean stale configs, remove dead cron jobs, free disk space. Never breaks running services.

Server Health Audit — Safe Optimization Workflow

Trigger

When asked to check/optimize the server, or before making significant changes to a production environment.

Phase 1: Reconnaissance (Read-Only)

# System basics
uname -a
uptime
free -h
df -h
systemctl list-units --state=failed 2>/dev/null

# Running services
systemctl list-units --type=service --state=running --no-pager

# Cron jobs
crontab -l 2>/dev/null
ls /etc/cron.d/ 2>/dev/null

# Docker state
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Size}}"
docker system df

# Process memory (top 10)
ps aux --sort=-%mem | head -11

.Status}}\t{{.Size}}" docker system df

Process memory (top 10)

ps aux --sort=-%mem | head -11 ```### Phase 2: Identify Issues

Look for: 1. Failed/restart-looping services — systemctl list-units --state=failed, check journalctl -u <service> --no-pager -n 20 2. Stale cron jobs — paths that no longer exist: crontab -l | grep -v "^#" | awk '{print $NF}' | while read f; do [ -e "$f" ] || echo "MISSING: $f"; done 3. Dead Docker images — docker images | grep '<none>' 4. Stale config entries — check OpenClaw config for disabled plugins not cleaned from entries/installs/allow lists 5. Temp/backups — ls -lh /tmp/ /var/tmp/ 2>/dev/null | head -20 6. Conflicting services — nginx vs Caddy on same ports 7. Large directories — du -sh /root/* /var/www/* /opt/* 2>/dev/null | sort -rh | head -20

Phase 3: Safe Optimizations

MUST follow these rules: - ✅ Safe: disable failed services, clean dead cron entries, remove stale config references, delete old backups in /tmp - ❌ Never: restart running services, modify active configs without confirmation, delete data directories, change ports - ⚠️ Always: backup configs before changing (cp config.json config.json.bak.$(date +%F))

Specific actions:

# Disable restart-looping service
systemctl disable --now <service>

# Clean dead cron jobs (replace entire crontab, don't use -r)
echo "" | crontab -  # then re-add valid entries

# Remove stale OpenClaw plugin entries from openclaw.json
# Edit: entries, installs, allow lists

# Clean Docker dangling images
docker image prune -f
aw.json
# Edit: entries, installs, allow lists

# Clean Docker dangling images
docker image prune -f# Clean temp backups older than 7 days
find /tmp -name 'backup-*' -mtime +7 -delete

Phase 4: Report

Always output: - Current system state (CPU/mem/disk/uptime) - Running services table - What was fixed (with before/after if applicable) - What was NOT touched and why - Open issues requiring user confirmation

Pitfalls

OpenClaw config hot-reload: OpenClaw detects config changes but waits until current task completes to restart. If restart fails, config backups are in /etc/lighthouse/openclaw/bak/
Don't kill Docker containers: OpenWebUI and OpenClaw Gateway run in Docker — always use systemctl not kill
Cron: use crontab - not crontab -r: -r removes ALL jobs including system ones. Always pipe replacement content.
Memory is tight on 3-4GB servers: OpenWebUI (~659MB) + OpenClaw (~654MB) + Hermes (~272MB) = ~1.6GB baseline. Leave at least 500MB free.