SLA Uptime Calculator

Calculate your downtime budget, on-call burden, and incident capacity. See what "five nines" really means for your team.

Quick start:

Your Setup

Fill in what you know. Unknown fields will be estimated or skipped. Hover the icons for explanations.

Uptime / SLA Target * The percentage of time your service must be available. Each additional "nine" (99.9% to 99.99%) reduces allowed downtime by 10x.

Coverage Model * When your team provides support. Business hours (8x5) means weekdays 9-5. Extended (12x5) adds evening coverage. 24/7 means round-the-clock.

SLA based on calendar time SLA based on coverage hours

Team Size Total engineers who can participate in on-call rotation. Larger teams can spread the load more sustainably.

Will estimate

Rotation Size How many people actively rotate through on-call duty. Should be less than or equal to team size.

Will estimate

Alerts / Month Total alerts your monitoring systems generate monthly. Research suggests >40/person/month leads to alert fatigue.

Will estimate

Incidents / Month Actual incidents requiring response (not just alerts). Typically 10-20% of total alerts become real incidents.

Will estimate

Incident Duration Average time from incident start to resolution (MTTR). Used to calculate how many incidents fit in your downtime budget.

Will estimate (15 min default)

Response Window Maximum time to acknowledge and begin responding to an incident. Tighter windows require more available responders.

Will estimate (15 min default)

MTTR Target Mean Time To Recovery target. Your goal for how quickly incidents should be resolved on average.

Not set

Escalation Levels Tiers of responders (L1, L2, L3). More levels provide backup but add coordination overhead.

Not set

After-Hours Incidents Percentage of incidents occurring outside business hours. Higher values increase on-call burden and burnout risk.

Will estimate based on coverage model

Results

Downtime Budget Maximum allowed downtime before breaching your SLA. Calculated from uptime target and measurement basis.

52 minutes

per month

1.71 minutes

daily

12 minutes

weekly

2.6 hours

quarterly

10.4 hours

yearly

Incidents Before Breach How many incidents of the specified duration can occur before exceeding your monthly downtime budget.

set duration to calculate

On-Call Load Days per year each person spends on-call. Research suggests >84 days/year for 24/7 coverage leads to burnout.

days/person/year

Recommended Rotation Minimum rotation size for sustainable on-call. Based on keeping individual load under burnout thresholds.

people minimum

On-Call Hours Weekly hours each person spends on-call based on coverage model and rotation size.

hours/person/week

Assumptions made:

• Incident duration estimated at 15 minutes
• Team size estimated at 5 people
• Rotation size estimated at 3 people
• Alert volume estimated at 40/month
• Incidents estimated at 2/month
• After-hours incidents estimated at 20%

Lower operational risk

Based on the inputs provided, this setup appears sustainable.

Summary

A 99.5% uptime target (measured against coverage hours) with business hours coverage allows 52 minutes of downtime per month. This configuration appears sustainable with lower operational risk.

Risk assessment Assumptions

Understanding SLA Uptime Percentages

Service Level Agreements (SLAs) define uptime targets as percentages, but these numbers can be deceptive. The difference between 99.9% and 99.99% uptime might seem small, but it represents a 10x reduction in allowed downtime.

Common SLA Tiers ("The Nines")

SLA	Name	Downtime/Month	Downtime/Year
99%	Two nines	7.3 hours	3.65 days
99.9%	Three nines	43.8 minutes	8.76 hours
99.95%	Three and a half nines	21.9 minutes	4.38 hours
99.99%	Four nines	4.38 minutes	52.6 minutes
99.999%	Five nines	26.3 seconds	5.26 minutes

Beyond the Math: Operational Reality

This calculator goes beyond simple downtime math. It helps you understand the real-world implications of your SLA target:

Incident budget: How many incidents can you have before breaching your SLA?
On-call burden: How many days per year will each team member be on-call?
Team sustainability: Is your rotation size adequate to prevent burnout?
Alert fatigue: Are your alert volumes sustainable for your team?

Calendar Time vs Coverage Hours

SLAs can be measured two ways. Calendar time means 24/7/365 - the standard for most cloud services and SaaS products. Coverage hours means only during your support window - common for internal tools or B2B services with defined support hours.

A 99.9% SLA measured against calendar time allows 43.8 minutes of downtime per month. The same 99.9% measured against business hours (8x5) allows only 10.4 minutes - because there are fewer total hours to measure against.

Understanding the Calculator Fields

Team Size vs Rotation Size

Team size is everyone who could potentially be on-call. Rotation size is how many people actively rotate through on-call duty. For example, a 10-person team might have a 5-person rotation if some members are exempt (managers, specialists, new hires ramping up). The rotation size directly impacts on-call burden - smaller rotations mean more frequent shifts per person.

Alerts vs Incidents

Alerts are notifications from your monitoring systems - not all require action. Incidents are actual problems requiring response. A healthy ratio is roughly 5:1 (alerts to incidents). If every alert is an incident, your alerting is well-tuned. If you have 100 alerts but only 2 incidents, you likely have alert fatigue problems.

Incident Duration (MTTR)

This is your Mean Time To Recovery - how long incidents typically last from detection to resolution. Industry benchmarks vary: elite teams achieve under 1 hour, high performers under 1 day, and the median is about 1 day to 1 week. This field is crucial for calculating your incident budget - how many incidents fit within your downtime allowance.

Response Window

The maximum time to acknowledge and begin responding to an incident. This isn't resolution time - it's how quickly someone starts working on it. Typical targets: Sev1 (critical) under 5 minutes, Sev2 (major) under 15 minutes, Sev3 (minor) under 1 hour. Tighter windows require more available responders and better alerting infrastructure.

Escalation Levels

Tiers of responders (L1, L2, L3). L1 handles initial response and common issues. L2 handles complex problems requiring deeper expertise. L3 is typically senior engineers or specialists for the hardest problems. More levels provide backup but add coordination overhead. Most teams use 2-3 levels.

Sustainability Thresholds

On-call days

More than 84 days/year (roughly 1 week per month) for 24/7 coverage leads to elevated burnout risk

Alerts per person

More than 40/month (~10/week) starts causing alert fatigue; above 80/month is high risk

Team utilization

Spending more than 15-20% of team capacity on incident response indicates unsustainable load

Rotation size

24/7 coverage with fewer than 4 people creates significant burnout risk

Frequently Asked Questions

What's the difference between uptime and availability?

They're often used interchangeably, but technically: uptime is whether the system is running, while availability is whether it's accessible and functioning correctly for users.

A system can be "up" but unavailable due to network issues, degraded performance, or partial failures. Most SLAs measure availability from the user's perspective.

How do I choose the right SLA target?

Consider your dependencies first - you can't be more reliable than your least reliable dependency. If your cloud provider offers 99.9%, achieving 99.99% requires significant redundancy investment.

Also consider business impact: a 99.9% SLA for an internal tool is very different from 99.9% for a payment system. Start conservative and tighten as you build operational maturity.

What's a sustainable on-call rotation?

For 24/7 coverage, research suggests a minimum of 4-6 people in rotation to avoid burnout. This keeps individual on-call duty to roughly 1 week per month or less.

For business hours coverage, 3-4 people can work sustainably. Also consider: compensatory time off, on-call pay, and limiting consecutive on-call days.

How do I reduce alert fatigue?

Target fewer than 10 actionable alerts per on-call shift.

Key strategies: delete alerts that never lead to action, consolidate related alerts, tune thresholds based on actual impact, and implement alert deduplication.

Every alert should have a clear runbook - if you don't know what to do when it fires, it shouldn't page.

Should scheduled maintenance count against SLA?

It depends on your SLA definition. Many SLAs exclude "scheduled maintenance windows" if announced in advance (typically 24-72 hours).

However, modern best practices favor zero-downtime deployments and maintenance. If you need maintenance windows, consider whether your SLA should be measured against coverage hours rather than calendar time.

Related Concepts

Error Budgets

Your downtime budget can be reframed as an "error budget" - a resource to spend on velocity. If you're well under budget, you can take more risks with deployments. If you're close to breaching, slow down and focus on reliability.

SLOs vs SLAs

SLOs (Service Level Objectives) are internal targets, typically stricter than SLAs. SLAs are contractual commitments with consequences for breach. Set your SLO tighter than your SLA to have buffer before contractual impact.

MTTR vs MTTD vs MTTF

MTTR (Mean Time To Recovery) - how long to fix issues. MTTD (Mean Time To Detect) - how long until you know there's a problem. MTTF (Mean Time To Failure) - how long between failures.

Composite SLAs

When services depend on each other, multiply their availabilities. Two 99.9% services in sequence = 99.8% combined. This is why distributed systems are hard - each dependency reduces your maximum achievable reliability.

SLA Uptime Calculator

SLA Uptime Report

Your Setup

Results

Understanding SLA Uptime Percentages

Common SLA Tiers ("The Nines")

Beyond the Math: Operational Reality

Calendar Time vs Coverage Hours

Understanding the Calculator Fields

Team Size vs Rotation Size

Alerts vs Incidents

Incident Duration (MTTR)

Response Window

Escalation Levels

Sustainability Thresholds

Frequently Asked Questions

Related Concepts

Error Budgets

SLOs vs SLAs

MTTR vs MTTD vs MTTF

Composite SLAs