Docs · Guides

Incident runbook

What to check, in order, when something breaks. Each section is intentionally short — print it out, keep it where the on-call engineer can find it at 02:00 AM. Severities below match what you’d use in a JIRA / Linear ticket.

0. Before anything — confirm the failure

  • Reproduce from a second device / browser / IP. ~10% of reported “outages” are a single laptop’s DNS or VPN.
  • Check status.cloudnx.in for platform-side incidents.
  • Check your audit logs for the last action — many incidents start with “I changed X then it stopped working”.

P1 — VM unreachable / production down

Symptom: users report 502 / connection refused / timeout.

  1. Is the VM running? Portal → Instances. Status should be running. If stopped orerror, click Start. If provisioning> 5 min, ping support.
  2. Browser SSH from the portal (instance detail → Open SSH). If the browser SSH works, the VM is alive — the problem is application-level. Skip to the next section.
  3. Disk full? df -h. If /is ≥ 95%, the OS won’t accept writes — truncate / rotate logs, delete /var/cache/apt/archives/*.deb, look at journalctl --vacuum-size=500M.
  4. Service crashed? sudo systemctl status <your-service>. sudo journalctl -u <your-service> --since "30 minutes ago" tells you why. Restart with sudo systemctl restart.
  5. Roll back to a snapshot if a recent deploy broke things. Instance detail → Snapshots → Restore. Takes ~30 seconds. Data written since the snapshot is lost — copy /var/log off first.
  6. Escalate to [email protected] with subject prefix [P1]and your account UUID + instance ID. Postpaid customers: use the on-call phone number on your contract.

P2 — Slow database

Symptom: queries that normally take 50 ms now take 5 seconds.

  1. Top by query. SELECT * FROM pg_stat_activity WHERE state <> 'idle' ORDER BY query_start;— one long-running query usually blocks everything else.
  2. Locks. SELECT * FROM pg_locks WHERE NOT granted; — anything not granted means contention. Find the blocking PID and decide whether to SELECT pg_terminate_backend(<pid>).
  3. I/O. iostat -xz 1 on the VM. %utilat 100% on the data device means you’ve outgrown the volume — resize up, or move the data dir to a bigger NVMe pool.
  4. Vacuum. SELECT relname, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;— if dead tuples are in millions and last autovacuum is hours old, run VACUUM ANALYZE <table>;.
  5. Missing index. EXPLAIN ANALYZEyour slow query. Seq scans on tables > 10k rows are a smell.

P2 — Disk filling up

  1. sudo du -h --max-depth=1 / | sort -h | tail -10 — top 10 largest dirs.
  2. Common culprits: /var/log (rotate / truncate), /var/lib/docker (docker system prune -af), /tmp (just delete), ~/.cache on the app user.
  3. If you can’t shrink, grow: /portal/volumes → Resize. Online, no downtime. Then resize2fs /dev/vdX inside the VM.

P1 — Suspected security event

Symptom: unfamiliar SSH login, unexpected process running, outbound traffic to suspicious IPs.

  1. Don’t shut down. Snapshot the VM first (instance detail → Snapshots → Take snapshot) for forensic analysis.
  2. Isolate. Firewall → set ingress to your IP only. Leaves the attacker locked out without destroying evidence.
  3. Rotate everything. SSH keys (/portal/ssh-keys), API keys (/portal/api-keys), database passwords, KMS keys (cloudnx kms rotate), any embedded secrets in container images.
  4. Notify. Email [email protected]+ your DPO. We’ll coordinate audit-log export and confirm whether the breach touched any CloudNx-managed service.
  5. Post-mortem. Write it within 48 hours. Even if the incident was small. Habits matter more than the specific event.

Region-wide outage

Single-region today (EU West). If the entire region is unreachable, your options are:

  1. Wait — we publish status updates every 15 min on status.cloudnx.in and email all customers within 30 min of detection.
  2. Failover to your offsite backup (see Backup strategy). Bring up a temporary stack on AWS / GCP from the latest backup. The 5-step CloudNx-to-AWS migration runs in reverse.
  3. For prolonged outages (> 4 hours), we credit prepaid customers automatically per the SLA.

Multi-region active-active will land when the second AX41 is procured. Until then, layer-3 offsite backups (CloudNx S3 + Backblaze B2) are your DR plan.

Escalation matrix

SeverityDefinitionChannelResponse SLA
P1Production down — users affectedsupport@ with [P1] · postpaid: oncall phonePaid: < 1h · Postpaid: < 30 min
P2Degraded — workaround existssupport@Paid: < 4h
P3Question / non-urgent bugsupport@ or DiscordPaid: < 24h business · Free: best-effort
SecuritySuspected breach / vulnerability[email protected]< 1h, 24×7