SLA Uptime Calculator

Calculate your downtime budget, on-call burden, and incident capacity. See what "five nines" really means for your team.

Quick start:

Understanding SLA Uptime Percentages

Service Level Agreements (SLAs) define uptime targets as percentages, but these numbers can be deceptive. The difference between 99.9% and 99.99% uptime might seem small, but it represents a 10x reduction in allowed downtime.

Common SLA Tiers ("The Nines")

SLANameDowntime/MonthDowntime/Year
99%Two nines7.3 hours3.65 days
99.9%Three nines43.8 minutes8.76 hours
99.95%Three and a half nines21.9 minutes4.38 hours
99.99%Four nines4.38 minutes52.6 minutes
99.999%Five nines26.3 seconds5.26 minutes

Beyond the Math: Operational Reality

This calculator goes beyond simple downtime math. It helps you understand the real-world implications of your SLA target:

  • Incident budget: How many incidents can you have before breaching your SLA?
  • On-call burden: How many days per year will each team member be on-call?
  • Team sustainability: Is your rotation size adequate to prevent burnout?
  • Alert fatigue: Are your alert volumes sustainable for your team?

Calendar Time vs Coverage Hours

SLAs can be measured two ways. Calendar time means 24/7/365 - the standard for most cloud services and SaaS products. Coverage hours means only during your support window - common for internal tools or B2B services with defined support hours.

A 99.9% SLA measured against calendar time allows 43.8 minutes of downtime per month. The same 99.9% measured against business hours (8x5) allows only 10.4 minutes - because there are fewer total hours to measure against.

Understanding the Calculator Fields

Team Size vs Rotation Size

Team size is everyone who could potentially be on-call. Rotation size is how many people actively rotate through on-call duty. For example, a 10-person team might have a 5-person rotation if some members are exempt (managers, specialists, new hires ramping up). The rotation size directly impacts on-call burden - smaller rotations mean more frequent shifts per person.

Alerts vs Incidents

Alerts are notifications from your monitoring systems - not all require action. Incidents are actual problems requiring response. A healthy ratio is roughly 5:1 (alerts to incidents). If every alert is an incident, your alerting is well-tuned. If you have 100 alerts but only 2 incidents, you likely have alert fatigue problems.

Incident Duration (MTTR)

This is your Mean Time To Recovery - how long incidents typically last from detection to resolution. Industry benchmarks vary: elite teams achieve under 1 hour, high performers under 1 day, and the median is about 1 day to 1 week. This field is crucial for calculating your incident budget - how many incidents fit within your downtime allowance.

Response Window

The maximum time to acknowledge and begin responding to an incident. This isn't resolution time - it's how quickly someone starts working on it. Typical targets: Sev1 (critical) under 5 minutes, Sev2 (major) under 15 minutes, Sev3 (minor) under 1 hour. Tighter windows require more available responders and better alerting infrastructure.

Escalation Levels

Tiers of responders (L1, L2, L3). L1 handles initial response and common issues. L2 handles complex problems requiring deeper expertise. L3 is typically senior engineers or specialists for the hardest problems. More levels provide backup but add coordination overhead. Most teams use 2-3 levels.

Sustainability Thresholds

On-call days

More than 84 days/year (roughly 1 week per month) for 24/7 coverage leads to elevated burnout risk

Alerts per person

More than 40/month (~10/week) starts causing alert fatigue; above 80/month is high risk

Team utilization

Spending more than 15-20% of team capacity on incident response indicates unsustainable load

Rotation size

24/7 coverage with fewer than 4 people creates significant burnout risk

Frequently Asked Questions

What's the difference between uptime and availability?

They're often used interchangeably, but technically: uptime is whether the system is running, while availability is whether it's accessible and functioning correctly for users.

A system can be "up" but unavailable due to network issues, degraded performance, or partial failures. Most SLAs measure availability from the user's perspective.

How do I choose the right SLA target?

Consider your dependencies first - you can't be more reliable than your least reliable dependency. If your cloud provider offers 99.9%, achieving 99.99% requires significant redundancy investment.

Also consider business impact: a 99.9% SLA for an internal tool is very different from 99.9% for a payment system. Start conservative and tighten as you build operational maturity.

What's a sustainable on-call rotation?

For 24/7 coverage, research suggests a minimum of 4-6 people in rotation to avoid burnout. This keeps individual on-call duty to roughly 1 week per month or less.

For business hours coverage, 3-4 people can work sustainably. Also consider: compensatory time off, on-call pay, and limiting consecutive on-call days.

How do I reduce alert fatigue?

Target fewer than 10 actionable alerts per on-call shift.

Key strategies: delete alerts that never lead to action, consolidate related alerts, tune thresholds based on actual impact, and implement alert deduplication.

Every alert should have a clear runbook - if you don't know what to do when it fires, it shouldn't page.

Should scheduled maintenance count against SLA?

It depends on your SLA definition. Many SLAs exclude "scheduled maintenance windows" if announced in advance (typically 24-72 hours).

However, modern best practices favor zero-downtime deployments and maintenance. If you need maintenance windows, consider whether your SLA should be measured against coverage hours rather than calendar time.

Related Concepts

Error Budgets

Your downtime budget can be reframed as an "error budget" - a resource to spend on velocity. If you're well under budget, you can take more risks with deployments. If you're close to breaching, slow down and focus on reliability.

SLOs vs SLAs

SLOs (Service Level Objectives) are internal targets, typically stricter than SLAs. SLAs are contractual commitments with consequences for breach. Set your SLO tighter than your SLA to have buffer before contractual impact.

MTTR vs MTTD vs MTTF

MTTR (Mean Time To Recovery) - how long to fix issues. MTTD (Mean Time To Detect) - how long until you know there's a problem. MTTF (Mean Time To Failure) - how long between failures.

Composite SLAs

When services depend on each other, multiply their availabilities. Two 99.9% services in sequence = 99.8% combined. This is why distributed systems are hard - each dependency reduces your maximum achievable reliability.

nines.fyi - Free SLA uptime calculator for DevOps and SRE teams