Level 2 - Alerts, mean time to close scorecard rule

Alerts mean time to close measures how efficiently your team resolves incidents from the time they're opened until they're closed. This metric indicates your team's incident response effectiveness and helps identify areas for improvement in your resolution processes.

About this scorecard rule

This alerts mean time to close rule is part of Level 2 (Proactive) in the business uptime maturity model. It evaluates how quickly your team can diagnose and resolve incidents, reflecting the maturity of your incident management processes.

Why this matters: Faster incident resolution reduces customer impact, minimizes business disruption, and indicates effective monitoring and response procedures. Teams that consistently resolve incidents quickly demonstrate operational excellence.

How this rule works

This rule analyzes the time between when an incident is opened and when it's closed, calculating the mean time to close across all incidents in your account. It measures the efficiency of your incident response and resolution processes.

Understanding your score

Pass (Green): Average incident resolution time is 30 minutes or less
Fail (Red): Average incident resolution time exceeds 30 minutes
Target: Consistent incident resolution within 30 minutes for most alerts

What this means:

Passing score: Your team has efficient incident response processes and can quickly diagnose and resolve issues
Failing score: Incidents take too long to resolve, potentially indicating process inefficiencies, complex diagnostics, or inadequate tooling

How to improve incident resolution times

If your score shows slow incident resolution, follow these steps to optimize your incident management process:

1. Analyze current incident patterns

Identify slow-resolving incidents: Review which types of incidents consistently take longer than 30 minutes
Examine common causes: Look for patterns in incident types, affected systems, or time of occurrence
Review resolution steps: Document what actions teams typically take to resolve different incident types

2. Optimize alert quality and context

Improve alert information:

Add context to alerts: Include relevant metadata, dashboards, and runbook links in alert notifications
Use descriptive alert names: Make alert titles clearly indicate the problem and affected system
Include baseline comparisons: Show normal vs. current values to help with quick assessment

Enhance alert routing:

Send alerts to right teams: Ensure alerts reach the people who can actually resolve the issue
Use intelligent routing: Route different alert types to appropriate specialists (database, frontend, infrastructure)
Provide escalation paths: Clear procedures for when initial responders can't resolve issues

3. Streamline diagnostic processes

Create effective runbooks:

Document common issues: Step-by-step resolution procedures for frequent problems
Include troubleshooting steps: Logical diagnostic flows that reduce investigation time
Link to relevant tools: Direct access to dashboards, logs, and diagnostic utilities

Improve tooling access:

Centralize monitoring data: Ensure responders can quickly access all relevant information
Use unified dashboards: Create incident-specific views that show all relevant metrics
Automate common checks: Reduce manual diagnostic steps with automated health checks

4. Enhance team response capabilities

Improve team readiness:

Cross-train team members: Ensure multiple people can handle different types of incidents
Document escalation procedures: Clear paths for when issues require additional expertise
Conduct incident response training: Regular practice sessions for common scenarios

Optimize response workflows:

Standardize communication: Use consistent channels and formats for incident updates
Automate routine responses: Use automation for common resolution steps
Track resolution progress: Clear visibility into who's working on what and current status

Measuring improvement

Track these metrics to verify your incident resolution improvements:

Mean time to close (MTTC): Target consistent resolution times under 30 minutes
Resolution time distribution: Monitor the spread of resolution times to identify outliers
First-time resolution rate: Percentage of incidents resolved without reopening
Escalation frequency: How often incidents require additional expertise or resources

Common scenarios and solutions

Complex incidents requiring deep investigation:

Problem: Some issues inherently require longer diagnostic time
Solution: Separate complex incidents into their own category and set different SLA expectations, or implement partial resolution acknowledgments

Incidents during off-hours:

Problem: Resolution times are slower when fewer experts are available
Solution: Improve on-call procedures, create better escalation paths, or enhance automated diagnostic tools

Repeated similar incidents:

Problem: Teams spend time re-solving the same types of problems
Solution: Invest in permanent fixes for recurring issues, create automated resolution scripts, or improve monitoring to catch root causes

Poor alert context:

Problem: Teams spend too much time understanding what's actually wrong
Solution: Enhance alert descriptions, include relevant dashboards, and provide direct links to affected systems

Understanding the 30-minute target

The 30-minute target represents a balance between thorough investigation and rapid response:

Why 30 minutes:

Customer impact: Most customers notice service degradation within this timeframe
Business impact: Longer incidents typically have exponentially higher business costs
Team efficiency: Indicates well-tuned processes and adequate preparation

When to adjust the target:

Lower target (15-20 minutes): High-availability services with strict SLAs
Higher target (45-60 minutes): Complex systems requiring deep investigation
Different targets by severity: Critical incidents need faster resolution than warnings

Advanced optimization strategies

Incident categorization

Categorize by resolution complexity:

Quick fixes: Simple restart or configuration changes (target: under 10 minutes)
Standard diagnostics: Typical troubleshooting procedures (target: 15-30 minutes)
Complex investigations: Deep technical analysis required (target: 45-60 minutes)

Automation opportunities

Automate routine responses:

Self-healing systems: Automatic restart or failover for common issues
Diagnostic automation: Automatic collection of relevant logs and metrics
Communication automation: Automatic status updates for stakeholders

Process optimization

Implement incident commanders:

Dedicated coordinators: Assign specific people to manage incident workflow
Clear communication: Single point of contact for updates and decisions
Resource allocation: Ensure right people are working on right problems

Important considerations

Balance speed with accuracy: Don't sacrifice proper investigation for faster closure times
Consider incident severity: Different types of incidents may require different resolution time targets
Account for business context: Weekend incidents may have different urgency than weekday issues
Measure meaningful closure: Ensure incidents are actually resolved, not just closed

Next steps

Immediate action: Analyze your current slowest-resolving incident types and implement quick wins
Process improvement: Develop standardized incident response procedures and runbooks
Tool enhancement: Improve alert context and diagnostic tool access
Team development: Invest in training and cross-functional incident response capabilities
Advance to Level 3: Once incident response is optimized, focus on service level attainment

For comprehensive guidance on incident management optimization, see our Alert Quality Management implementation guide.