Month 1: Foundation Setup
Define roles and responsibilities, establish severity matrix, set up basic logging and alerting, create initial runbooks
- Role definitions complete
- Severity matrix documented
- Basic alerting operational
A founder- and engineer-ready handbook to stand up a lightweight, repeatable incident response program—roles, severity definitions, triage flow, evidence handling, communications, AI/LLM-specific incidents, tabletop drills, and metrics. Built to be credible in audits without slowing delivery.
Incidents are unavoidable. Chaos is optional. This guide gives you a simple, repeatable incident response program built for startups: clear roles, a 30/60/90 triage flow, evidence handling, internal and external communications, AI/LLM-specific incident playbooks, and a quarterly drill cadence. It's designed to satisfy buyer/audit expectations while preserving engineering velocity.
| Response Gap | Business Impact | Risk Level | Financial Impact |
|---|---|---|---|
| No clear incident commander | Extended downtime, chaotic response | High | $50K-$200K per hour of downtime |
| Poor evidence handling | Failed audits, legal liability | Medium | $75K-$300K in legal/compliance costs |
| Inadequate communications | Customer churn, reputation damage | High | $100K-$500K in lost revenue |
| Missing AI incident playbooks | Cost overruns, safety failures | High | $80K-$400K in operational risk |
| No tabletop exercises | Unprepared teams, slow response | Medium | $40K-$150K in productivity loss |
| Poor post-incident learning | Repeated incidents, technical debt | Medium | $60K-$250K in recurring costs |
| Role | Time Commitment | Key Responsibilities | Critical Decisions |
|---|---|---|---|
| Incident Commander (IC) | 100% during incident | Owns decisions and timeline; sets severity; assigns tasks; watches the clock | Severity classification, external comms approval, resource allocation |
| Technical Lead (TL) | 100% during incident | Leads diagnosis, isolation, and remediation; coordinates with service owners | Technical approach, rollback decisions, containment strategy |
| Scribe | 100% during incident | Captures timeline, decisions, commands run; preserves evidence pointers | Evidence collection scope, documentation standards |
| Communications Lead | 50-70% during incident | Prepares stakeholder updates; coordinates status page and customer comms | Message timing, content approval, audience targeting |
| Legal/Privacy Contact | 20-40% during incident | Advises on regulatory notices, data handling, contractual obligations | Legal notification requirements, external messaging approval |
| Security Analyst | 60-80% during incident | Guides containment vs eradication, forensics, log/evidence integrity | Forensic approach, containment strategy, follow-up controls |
| Metric Category | Key Metrics | Target Goals | Measurement Frequency |
|---|---|---|---|
| Response Speed | Time to IC/TL assignment, Containment time | SEV-1: <10min, SEV-2: <30min, Containment <60min | Per incident |
| Evidence Quality | Evidence completeness, Chain of custody integrity | ≥90% checklist completion, 100% custody tracking | Per incident |
| Communication Effectiveness | Customer update timeliness, Internal notification speed | Within promised windows, SEV-1 <15min | Per incident |
| Learning & Improvement | Postmortem action closure, Tabletop exercise frequency | ≥80% closed in 30 days, Quarterly drills | Monthly |
| AI Incident Readiness | Token cost variance, Guardrail effectiveness | <10% variance, 100% eval coverage | Monthly |
| Program Maturity | Runbook coverage, Team training completion | 100% critical scenarios, Annual certification | Quarterly |
Define roles and responsibilities, establish severity matrix, set up basic logging and alerting, create initial runbooks
Implement triage flow, establish evidence handling, create communications templates, conduct first tabletop
Refine based on learnings, add AI-specific playbooks, establish metrics, integrate with compliance
| Severity | Definition | Initial Response Target | Comms Cadence | Escalation Requirements |
|---|---|---|---|---|
| SEV-1 | Customer-visible security incident or confirmed data exposure; ongoing exploitation | IC within 10 minutes; full team engaged | Internal every 30–60 min; external every 60–120 min | Executive team, Legal, Board if material |
| SEV-2 | High-risk vulnerability actively exploited in limited scope; potential data exposure | IC within 30 minutes; core team within 60 minutes | Internal hourly; external if customer impact | Department heads, Legal if data exposure |
| SEV-3 | Suspicious activity, control degradation, or third-party advisory with potential exposure | IC within 4 hours; investigation owner assigned | Daily internal updates until closure | Team leads, Security owner |
Assign IC/TL/Scribe; set provisional severity; snapshot critical logs/metrics; isolate blast radius
Block indicators of compromise; rotate exposed credentials; validate with logs; decide on external comms
Patch/rollback/fix configuration; increase monitoring; confirm path to recovery; publish updates
Snapshot key logs/metrics, relevant database metadata, and configuration states before mutation
Designate a single evidence owner. Use append-only storage or write-once buckets with timestamps
Collect only what's necessary: auth logs, admin actions, data export logs, infra events
Retain evidence per policy (e.g., 12–24 months). Label with incident ID, severity, and classification
Capture prompt/response logs, model outputs, guardrail triggers, token usage patterns
Automate evidence collection for common incident types to ensure consistency and speed
| Incident Type | Detection Signals | Containment Actions | Recovery Steps |
|---|---|---|---|
| Prompt Injection/Data Leakage | Guardrail triggers, abnormal output, data pattern alerts | Disable risky tools, scrub prompts, redact logs, rotate tokens | Review prompts, enhance filters, update training data |
| Model/Provider Outage | API errors, timeout spikes, provider status alerts | Failover to backup provider, switch models, degrade gracefully | Post-event vendor review, improve abstraction layer |
| Hallucination/Safety Regression | Eval failures, user reports, quality metrics degradation | Block release, rollback model version, increase safety filters | Add targeted tests, update evaluation criteria |
| Runaway Token Spend | Budget alerts, cost spikes, usage pattern anomalies | Enforce budgets, cut off abusive patterns, implement caching | Optimize prompts, review caching strategy, set tighter limits |
| Cost Category | Small Team ($) | Medium Team ($$) | Large Team ($$$) |
|---|---|---|---|
| Team Training & Certification | $15K-$35K | $35K-$85K | $85K-$200K |
| Tools & Infrastructure | $20K-$50K | $50K-$120K | $120K-$280K |
| Consulting & External Support | $25K-$60K | $60K-$150K | $150K-$350K |
| Tabletop Exercises & Drills | $10K-$25K | $25K-$60K | $60K-$140K |
| Incident Response Retainer | $30K-$70K | $70K-$170K | $170K-$400K |
| Total Budget Range | $100K-$240K | $240K-$585K | $585K-$1.37M |
| Risk Category | Likelihood | Impact | Mitigation Strategy | Owner |
|---|---|---|---|---|
| Role Confusion During Incident | High | High | Clear role definitions, regular training, backup assignments | Incident Commander |
| Evidence Handling Errors | Medium | High | Standardized procedures, automated collection, training | Security Analyst |
| Communication Breakdown | High | Medium | Template library, escalation matrix, regular drills | Communications Lead |
| AI Incident Misclassification | Medium | High | Specialized playbooks, AI-trained responders, vendor coordination | Technical Lead |
| Regulatory Notification Failures | Low | High | Legal playbook integration, notification checklists, expert review | Legal/Privacy Contact |
| Team Burnout | Medium | Medium | Rotation schedules, psychological safety, post-incident support | Engineering Manager |
Using full team mobilization for minor incidents causes fatigue and reduces effectiveness
Rushing to fix problems without proper evidence collection compromises forensic integrity
Relying on individual knowledge rather than documented runbooks and procedures
Providing unclear or delayed updates to customers during incidents damages trust
Failing to capture and act on lessons learned leads to repeated incidents
Deploying AI capabilities without proper safety controls and incident procedures
Ship safer upgrades—predict risk, tighten tests, stage rollouts, and use AI where it helps
Read more →A clear criteria-and-evidence framework to choose and evolve your stack—now with AI readiness and TCO modeling
Read more →Make risks quantifiable and investable—evidence, scoring, mitigations, and decision gates
Read more →Pass tech diligence with confidence—evidence, not anecdotes
Read more →A staged plan for implementing security and compliance without killing speed
Read more →Stand up roles, runbooks, and drills that reduce risk and downtime—AI safety, evidence handling, and buyer/audit expectations included.