Observability in Action: SLI, SLO, SLA, and Error Budgets

Chandan Kumar
3 min readNov 21, 2024

SLIs, SLOs, SLAs, and Error Budgets are key concepts in observability that help teams measure, manage, and balance system performance and reliability. By tracking these metrics, teams can ensure they meet customer expectations while maintaining the flexibility to innovate.

Photo by Miguel A Amutio on Unsplash

1. SLI (Service Level Indicator)

  • What it is: A measurement of a specific aspect of service performance.
  • Think of it as: The raw numbers or metrics you are tracking.

Example:

  • “The percentage of requests served under 200ms latency.”
  • “The uptime percentage in the last month.”

2. SLO (Service Level Objective)

  • What it is: A target or goal based on the SLI.
  • Think of it as: What you aim to achieve to maintain service quality.

Example:

  • “We aim for 99.9% of requests to be served under 200ms latency.”
  • “We aim for 99.95% uptime per month.”

3. SLA (Service Level Agreement)

  • What it is: A formal agreement between a provider and a customer about the service level they can expect, often including penalties for failing to meet it.
  • Think of it as: The promise you make and the consequences if you break it.

Example:

  • “We guarantee 99.95% uptime monthly. If we don’t meet it, we’ll give you a refund.”

Analogy:

Imagine a pizza delivery service:

  • SLI: “It took 25 minutes to deliver the pizza.” (the actual measurement)
  • SLO: “We aim to deliver pizzas within 30 minutes 95% of the time.” (the goal)
  • SLA: “If we deliver late more than 5% of the time in a month, you get a free pizza.” (the contract)

Error Budget -

An error budget is a concept that helps teams balance reliability with speed of innovation by quantifying how much failure is acceptable for a system within a specific period.

In Simple Terms

An error budget is the amount of “wiggle room” your system has to experience failures (like downtime or slow performance) without breaching your Service Level Objective (SLO).

How It Works

Start with the SLO:

  • Suppose your SLO is 99.9% uptime in a month.
  • Total minutes in a 30-day month = 43,200 minutes.
  • 99.9% uptime means you can have 0.1% downtime, or 43.2 minutes of acceptable downtime.

Error Budget:
The 43.2 minutes is your error budget for the month. You can “spend” this budget on unexpected downtime, incidents, or experiments that might cause instability.

Why It’s Useful

Encourages Innovation:
Teams can take calculated risks (e.g., deploying new features) as long as they stay within the error budget.

Prevents Overengineering:

If reliability is too high, teams might slow down innovation unnecessarily. An error budget ensures reliability is balanced with delivering new features.

Guides Decision-Making:

If you’ve already “spent” most of the error budget, you might decide to:

  • Pause risky deployments.
  • Focus on improving system stability.

Example

Imagine you run a music streaming service with an SLO of 99.95% uptime:

  • Total error budget for the month: ~21.6 minutes of downtime.

Scenario 1: Error budget is healthy.

  • In the first week, there’s only 5 minutes of downtime.
  • The remaining budget allows you to deploy new features confidently.

Scenario 2: Error budget is almost exhausted.

  • After a big outage, you’ve used 20 minutes of downtime by mid-month.
  • Now, you prioritise fixing stability issues instead of releasing new features.

Key Benefits

  1. Collaboration Between Teams:
    Both developers and reliability engineers can agree on acceptable levels of risk.
  2. Predictable System Performance:
    It ensures reliability levels meet customer expectations without unnecessary over investment.
  3. Data-Driven Choices:
    Decisions about reliability and innovation are made based on measurable metrics, not guesswork.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Chandan Kumar
Chandan Kumar

Written by Chandan Kumar

A Devil’s Advocate and a Software Developer

No responses yet

Write a response