Alerts mean time to close measures how efficiently your team resolves incidents from the time they're opened until they're closed. This metric indicates your team's incident response effectiveness and helps identify areas for improvement in your resolution processes.
About this scorecard rule
This alerts mean time to close rule is part of Level 2 (Proactive) in the business uptime maturity model. It evaluates how quickly your team can diagnose and resolve incidents, reflecting the maturity of your incident management processes.
Why this matters: Faster incident resolution reduces customer impact, minimizes business disruption, and indicates effective monitoring and response procedures. Teams that consistently resolve incidents quickly demonstrate operational excellence.
How this rule works
This rule analyzes the time between when an incident is opened and when it's closed, calculating the mean time to close across all incidents in your account. It measures the efficiency of your incident response and resolution processes.
Understanding your score
- Pass (Green): Average incident resolution time is 30 minutes or less
- Fail (Red): Average incident resolution time exceeds 30 minutes
- Target: Consistent incident resolution within 30 minutes for most alerts
What this means:
- Passing score: Your team has efficient incident response processes and can quickly diagnose and resolve issues
- Failing score: Incidents take too long to resolve, potentially indicating process inefficiencies, complex diagnostics, or inadequate tooling
How to improve incident resolution times
If your score shows slow incident resolution, follow these steps to optimize your incident management process:
1. Analyze current incident patterns
- Identify slow-resolving incidents: Review which types of incidents consistently take longer than 30 minutes
- Examine common causes: Look for patterns in incident types, affected systems, or time of occurrence
- Review resolution steps: Document what actions teams typically take to resolve different incident types
2. Optimize alert quality and context
Improve alert information:
- Add context to alerts: Include relevant metadata, dashboards, and runbook links in alert notifications
- Use descriptive alert names: Make alert titles clearly indicate the problem and affected system
- Include baseline comparisons: Show normal vs. current values to help with quick assessment
Enhance alert routing:
- Send alerts to right teams: Ensure alerts reach the people who can actually resolve the issue
- Use intelligent routing: Route different alert types to appropriate specialists (database, frontend, infrastructure)
- Provide escalation paths: Clear procedures for when initial responders can't resolve issues
3. Streamline diagnostic processes
Create effective runbooks:
- Document common issues: Step-by-step resolution procedures for frequent problems
- Include troubleshooting steps: Logical diagnostic flows that reduce investigation time
- Link to relevant tools: Direct access to dashboards, logs, and diagnostic utilities
Improve tooling access:
- Centralize monitoring data: Ensure responders can quickly access all relevant information
- Use unified dashboards: Create incident-specific views that show all relevant metrics
- Automate common checks: Reduce manual diagnostic steps with automated health checks
4. Enhance team response capabilities
Improve team readiness:
- Cross-train team members: Ensure multiple people can handle different types of incidents
- Document escalation procedures: Clear paths for when issues require additional expertise
- Conduct incident response training: Regular practice sessions for common scenarios
Optimize response workflows:
- Standardize communication: Use consistent channels and formats for incident updates
- Automate routine responses: Use automation for common resolution steps
- Track resolution progress: Clear visibility into who's working on what and current status
Measuring improvement
Track these metrics to verify your incident resolution improvements:
- Mean time to close (MTTC): Target consistent resolution times under 30 minutes
- Resolution time distribution: Monitor the spread of resolution times to identify outliers
- First-time resolution rate: Percentage of incidents resolved without reopening
- Escalation frequency: How often incidents require additional expertise or resources
Common scenarios and solutions
Complex incidents requiring deep investigation:
- Problem: Some issues inherently require longer diagnostic time
- Solution: Separate complex incidents into their own category and set different SLA expectations, or implement partial resolution acknowledgments
Incidents during off-hours:
- Problem: Resolution times are slower when fewer experts are available
- Solution: Improve on-call procedures, create better escalation paths, or enhance automated diagnostic tools
Repeated similar incidents:
- Problem: Teams spend time re-solving the same types of problems
- Solution: Invest in permanent fixes for recurring issues, create automated resolution scripts, or improve monitoring to catch root causes
Poor alert context:
- Problem: Teams spend too much time understanding what's actually wrong
- Solution: Enhance alert descriptions, include relevant dashboards, and provide direct links to affected systems
Understanding the 30-minute target
The 30-minute target represents a balance between thorough investigation and rapid response:
Why 30 minutes:
- Customer impact: Most customers notice service degradation within this timeframe
- Business impact: Longer incidents typically have exponentially higher business costs
- Team efficiency: Indicates well-tuned processes and adequate preparation
When to adjust the target:
- Lower target (15-20 minutes): High-availability services with strict SLAs
- Higher target (45-60 minutes): Complex systems requiring deep investigation
- Different targets by severity: Critical incidents need faster resolution than warnings
Advanced optimization strategies
Incident categorization
Categorize by resolution complexity:
- Quick fixes: Simple restart or configuration changes (target: under 10 minutes)
- Standard diagnostics: Typical troubleshooting procedures (target: 15-30 minutes)
- Complex investigations: Deep technical analysis required (target: 45-60 minutes)
Automation opportunities
Automate routine responses:
- Self-healing systems: Automatic restart or failover for common issues
- Diagnostic automation: Automatic collection of relevant logs and metrics
- Communication automation: Automatic status updates for stakeholders
Process optimization
Implement incident commanders:
- Dedicated coordinators: Assign specific people to manage incident workflow
- Clear communication: Single point of contact for updates and decisions
- Resource allocation: Ensure right people are working on right problems
Important considerations
- Balance speed with accuracy: Don't sacrifice proper investigation for faster closure times
- Consider incident severity: Different types of incidents may require different resolution time targets
- Account for business context: Weekend incidents may have different urgency than weekday issues
- Measure meaningful closure: Ensure incidents are actually resolved, not just closed
Next steps
- Immediate action: Analyze your current slowest-resolving incident types and implement quick wins
- Process improvement: Develop standardized incident response procedures and runbooks
- Tool enhancement: Improve alert context and diagnostic tool access
- Team development: Invest in training and cross-functional incident response capabilities
- Advance to Level 3: Once incident response is optimized, focus on service level attainment
For comprehensive guidance on incident management optimization, see our Alert Quality Management implementation guide.