Level 1 - Service delivery alert coverage scorecard rule

Service delivery alert coverage ensures that your customer-facing applications and services have monitoring alerts in place to detect issues that could impact user experience and business operations.

About this scorecard rule

This service delivery alert coverage rule is part of Level 1 (Reactive) in the business uptime maturity model. It verifies that your applications and services have basic alerting configured to notify you when customer-facing problems occur.

Why this matters: Service delivery issues directly impact customer experience and business revenue. Without proper application alerting, you might only discover problems when customers report them, leading to longer outages and damaged customer relationships.

How this rule works

This rule examines your service delivery entities and checks whether they have alert conditions defined. Specifically, it looks for alerts on:

APM-APPLICATION entities: Backend applications and services monitored by APM agents
BROWSER-APPLICATION entities: Frontend web applications monitored by browser monitoring
MOBILE-APPLICATION entities: Mobile apps monitored by mobile monitoring
SYNTH-MONITOR entities: Synthetic monitors that simulate user interactions

The rule fails if any monitored service delivery entity lacks at least one alert condition.

Understanding your score

Pass (Green): All service delivery entities have at least one alert condition defined
Fail (Red): One or more service delivery entities lack alert coverage
Target: 100% alert coverage across all customer-facing applications and services

What this means:

Passing score: Your application monitoring foundation is in place to detect customer-impacting issues
Failing score: Some applications or services could fail without alerting your team, potentially impacting customers

How to improve service delivery alert coverage

If your score shows missing service delivery alerts, follow these steps to establish comprehensive coverage:

1. Identify uncovered services

Review the failing entities: Identify which specific applications or services lack alert coverage
Prioritize by customer impact: Focus first on customer-facing applications and revenue-critical services
Assess service criticality: Determine which services require immediate vs. delayed alerting

2. Set up essential service delivery alerts

Configure alerts for these critical metrics based on your entity type:

APM application alerts:

Error rate: Alert when error percentage exceeds 5% for 5 minutes
Response time: Alert when average response time exceeds acceptable thresholds (e.g., >2 seconds)
Throughput: Alert when request volume drops significantly, indicating potential outages
Apdex score: Alert when user satisfaction scores fall below acceptable levels (e.g., less than 0.8)

Browser application alerts:

JavaScript errors: Alert when frontend error rates spike
Page load time: Alert when page load times exceed user experience thresholds
Core Web Vitals: Alert when metrics like Largest Contentful Paint or Cumulative Layout Shift degrade
User sessions: Alert when active user sessions drop unexpectedly

Mobile application alerts:

Crash rate: Alert when app crash rates exceed 1-2%
Network errors: Alert when network request failures spike
App launch time: Alert when app startup times become unacceptable
User interactions: Alert when key user actions (login, purchase) fail frequently

Synthetic monitor alerts:

Monitor failures: Alert immediately when synthetic checks fail
Performance degradation: Alert when synthetic transaction times increase significantly
Availability: Alert when uptime drops below SLA requirements (e.g., less than 99.9%)
Multi-location failures: Alert when the same issue appears across multiple locations

3. Configure alerts effectively

Set appropriate thresholds:

Base thresholds on historical performance data and business requirements
Use different thresholds for different environments (production should be more sensitive)
Consider user experience impact when setting response time and error rate thresholds

Choose proper evaluation windows:

Use shorter windows (2-5 minutes) for critical user-facing issues
Use longer windows (10-15 minutes) for performance trends that need time to establish
Avoid windows so short that they trigger on temporary fluctuations

4. Establish incident response procedures

Define notification channels: Set up integrations with Slack, PagerDuty, or email
Assign responsible teams: Ensure alerts reach teams who can diagnose and fix issues
Create escalation paths: Define what happens if alerts aren't acknowledged within SLA timeframes
Test response procedures: Verify that teams can actually respond to and resolve alerted issues

Measuring improvement

Track these metrics to verify your service delivery alert coverage improvements:

Coverage percentage: Aim for 100% alert coverage on production applications and services
Mean time to detection (MTTD): Measure how quickly alerts identify customer-impacting issues
Alert accuracy: Monitor the percentage of alerts that represent genuine problems requiring action
Customer impact reduction: Track whether faster detection leads to shorter customer-facing outages

Common scenarios and solutions

Legacy or unused applications:

Problem: Old applications still appear in monitoring but no longer serve customers
Solution: Remove unused applications from monitoring or tag them as deprecated to exclude from coverage requirements

Development and testing environments:

Problem: Non-production applications clutter alert coverage metrics
Solution: Use tags or naming conventions to separate environments and focus coverage rules on production services

Microservices architectures:

Problem: Many small services make 100% coverage challenging to achieve and maintain
Solution: Prioritize customer-facing services and critical dependencies, use service maps to identify key components

Third-party dependencies:

Problem: External services aren't under your control but impact your applications
Solution: Create synthetic monitors to test critical third-party integrations and APIs

Advanced considerations

Customizing coverage rules

You may need to adjust the scorecard rule if:

Different service types: Your architecture includes other entity types (Lambda functions, databases, message queues)
Business criticality levels: Some services are more critical than others and require different alert strategies
Deployment patterns: Canary deployments or blue-green deployments may temporarily affect coverage

Alert coordination and dependencies

For complex service architectures:

Service dependencies: Configure alerts to account for upstream service failures
Alert correlation: Group related alerts to avoid notification storms during incidents
Intelligent alerting: Use machine learning features to reduce false positives and improve signal quality

Important considerations

Customer impact focus: Prioritize alerts for issues that directly affect customer experience
Balance coverage with quality: Ensure comprehensive coverage doesn't create alert fatigue
Regular maintenance: Review and update alert conditions as your applications evolve
Cross-team coordination: Ensure development and operations teams collaborate on alert strategy

Next steps

Immediate action: Set up basic alerts for any services currently lacking coverage
Ongoing monitoring: Review this scorecard rule weekly to maintain coverage as services change
Quality improvement: Focus on alert effectiveness and reducing false positives
Advance to Level 2: Once service delivery alerting is established, focus on proactive monitoring practices

For detailed guidance on application monitoring setup, see our documentation for APM, browser monitoring, mobile monitoring, and synthetic monitoring.