SLA Uptime Calculator
Calculate your downtime budget, on-call burden, and incident capacity. See what "five nines" really means for your team.
SLA Uptime Report
Generated from nines.fyi
Results
- • Incident duration estimated at 15 minutes
- • Team size estimated at 5 people
- • Rotation size estimated at 3 people
- • Alert volume estimated at 40/month
- • Incidents estimated at 2/month
- • After-hours incidents estimated at 20%
Based on the inputs provided, this setup appears sustainable.
A 99.5% uptime target (measured against coverage hours) with business hours coverage allows 52 minutes of downtime per month. This configuration appears sustainable with lower operational risk.
Understanding SLA Uptime Percentages
Service Level Agreements (SLAs) define uptime targets as percentages, but these numbers can be deceptive. The difference between 99.9% and 99.99% uptime might seem small, but it represents a 10x reduction in allowed downtime.
Common SLA Tiers ("The Nines")
| SLA | Name | Downtime/Month | Downtime/Year |
|---|---|---|---|
| 99% | Two nines | 7.3 hours | 3.65 days |
| 99.9% | Three nines | 43.8 minutes | 8.76 hours |
| 99.95% | Three and a half nines | 21.9 minutes | 4.38 hours |
| 99.99% | Four nines | 4.38 minutes | 52.6 minutes |
| 99.999% | Five nines | 26.3 seconds | 5.26 minutes |
Beyond the Math: Operational Reality
This calculator goes beyond simple downtime math. It helps you understand the real-world implications of your SLA target:
- Incident budget: How many incidents can you have before breaching your SLA?
- On-call burden: How many days per year will each team member be on-call?
- Team sustainability: Is your rotation size adequate to prevent burnout?
- Alert fatigue: Are your alert volumes sustainable for your team?
Calendar Time vs Coverage Hours
SLAs can be measured two ways. Calendar time means 24/7/365 - the standard for most cloud services and SaaS products. Coverage hours means only during your support window - common for internal tools or B2B services with defined support hours.
A 99.9% SLA measured against calendar time allows 43.8 minutes of downtime per month. The same 99.9% measured against business hours (8x5) allows only 10.4 minutes - because there are fewer total hours to measure against.
Understanding the Calculator Fields
Team Size vs Rotation Size
Team size is everyone who could potentially be on-call. Rotation size is how many people actively rotate through on-call duty. For example, a 10-person team might have a 5-person rotation if some members are exempt (managers, specialists, new hires ramping up). The rotation size directly impacts on-call burden - smaller rotations mean more frequent shifts per person.
Alerts vs Incidents
Alerts are notifications from your monitoring systems - not all require action. Incidents are actual problems requiring response. A healthy ratio is roughly 5:1 (alerts to incidents). If every alert is an incident, your alerting is well-tuned. If you have 100 alerts but only 2 incidents, you likely have alert fatigue problems.
Incident Duration (MTTR)
This is your Mean Time To Recovery - how long incidents typically last from detection to resolution. Industry benchmarks vary: elite teams achieve under 1 hour, high performers under 1 day, and the median is about 1 day to 1 week. This field is crucial for calculating your incident budget - how many incidents fit within your downtime allowance.
Response Window
The maximum time to acknowledge and begin responding to an incident. This isn't resolution time - it's how quickly someone starts working on it. Typical targets: Sev1 (critical) under 5 minutes, Sev2 (major) under 15 minutes, Sev3 (minor) under 1 hour. Tighter windows require more available responders and better alerting infrastructure.
Escalation Levels
Tiers of responders (L1, L2, L3). L1 handles initial response and common issues. L2 handles complex problems requiring deeper expertise. L3 is typically senior engineers or specialists for the hardest problems. More levels provide backup but add coordination overhead. Most teams use 2-3 levels.
Sustainability Thresholds
More than 84 days/year (roughly 1 week per month) for 24/7 coverage leads to elevated burnout risk
More than 40/month (~10/week) starts causing alert fatigue; above 80/month is high risk
Spending more than 15-20% of team capacity on incident response indicates unsustainable load
24/7 coverage with fewer than 4 people creates significant burnout risk
Frequently Asked Questions
What's the difference between uptime and availability?
They're often used interchangeably, but technically: uptime is whether the system is running, while availability is whether it's accessible and functioning correctly for users.
A system can be "up" but unavailable due to network issues, degraded performance, or partial failures. Most SLAs measure availability from the user's perspective.
How do I choose the right SLA target?
Consider your dependencies first - you can't be more reliable than your least reliable dependency. If your cloud provider offers 99.9%, achieving 99.99% requires significant redundancy investment.
Also consider business impact: a 99.9% SLA for an internal tool is very different from 99.9% for a payment system. Start conservative and tighten as you build operational maturity.
What's a sustainable on-call rotation?
For 24/7 coverage, research suggests a minimum of 4-6 people in rotation to avoid burnout. This keeps individual on-call duty to roughly 1 week per month or less.
For business hours coverage, 3-4 people can work sustainably. Also consider: compensatory time off, on-call pay, and limiting consecutive on-call days.
How do I reduce alert fatigue?
Target fewer than 10 actionable alerts per on-call shift.
Key strategies: delete alerts that never lead to action, consolidate related alerts, tune thresholds based on actual impact, and implement alert deduplication.
Every alert should have a clear runbook - if you don't know what to do when it fires, it shouldn't page.
Should scheduled maintenance count against SLA?
It depends on your SLA definition. Many SLAs exclude "scheduled maintenance windows" if announced in advance (typically 24-72 hours).
However, modern best practices favor zero-downtime deployments and maintenance. If you need maintenance windows, consider whether your SLA should be measured against coverage hours rather than calendar time.
Related Concepts
Error Budgets
Your downtime budget can be reframed as an "error budget" - a resource to spend on velocity. If you're well under budget, you can take more risks with deployments. If you're close to breaching, slow down and focus on reliability.
SLOs vs SLAs
SLOs (Service Level Objectives) are internal targets, typically stricter than SLAs. SLAs are contractual commitments with consequences for breach. Set your SLO tighter than your SLA to have buffer before contractual impact.
MTTR vs MTTD vs MTTF
MTTR (Mean Time To Recovery) - how long to fix issues. MTTD (Mean Time To Detect) - how long until you know there's a problem. MTTF (Mean Time To Failure) - how long between failures.
Composite SLAs
When services depend on each other, multiply their availabilities. Two 99.9% services in sequence = 99.8% combined. This is why distributed systems are hard - each dependency reduces your maximum achievable reliability.