Lies, damned lies, and SLAs
B2B SaaS SLAs aren't lies, but they are carefully worded legal statements intended to mislead a buyer. This isn't to say that buyers are being cheated, in fact, buyers often want to be misled. Nonetheless, I think we can do better.
A business buying software isn't just looking for someone to solve their pain point. They're looking for a relationship, trust, reliability, risk management, and compliance. The buyer may be required by law, or their own customers to meet various standards of managing risk.
This is one of the reasons why we end up with Dilbert like experiences at large enterprises. A Kafkaesque labyrinth of legal, security, risk, and other teams who must sign off on anything that is done.
SLAs are an important feature analyzed by SaaS buyers. The buyers want to know that this important business system is reliable, and that they can trust it, and that there is a punishment for the vendor failing.
However, sellers intentionally undermine them, and the weaknesses are ignored by the buyers who desperately need a tool they can shove through their own complex process. As a result, SLAs operate mostly as a form of checkbox compliance rather than an enforcement of service quality.
This is due to three key points.
- The vendor defines what constitutes an outage (SLIs) in a restrictive manner such that failures that a reasonable user would call an outage are not considered an outage.
- The vendor defines the duration or calculation of any given outage, which means they can sweep an outage under the rug entirely or reduce the duration of one. Amazon's AWS is frequently shown to do this in posts like this https://ably.com/blog/honest-status-reporting-aws-service
- When an outage does happen, the punishment the vendor faces has no punitive component in it.
Let's look at a few SLA policy docs to see what they say.
Slack
You can find Slack's SLA document here: https://slack.com/terms/service-level-agreement.
The important parts to note are:
Downtime excludes the following:Slowness or other performance issues with individual features (link expansions, search, file uploads, etc.)
To review current and historical Uptime, visit Slack Status.
If we fall short of our Uptime commitment, we’ll apply a credit to each affected account equal to 10 times the amount that the workspace (or, as applicable, org) paid during the period Slack was down (we call these Service Credits).
So, let's apply our 3 checks above:
- Outages would require all of Slack to down, and core functionality like Search being broken would not trigger the invocation of the SLA.
- Outages are defined based on the status page that Slack operates and are of questionable veracity.
- Unlike others I have reviewed, Slack has done a good job and offers a meaningful punitive penalty to themselves.
Atlassian
You can find the Atlassian SLA document at: https://www.atlassian.com/legal/sla with a couple important linked-to pages at https://www.atlassian.com/legal/sla/service-credits and https://www.atlassian.com/legal/sla/covered-experiences
The important parts to note are:
If we confirm there is a failure to meet the Service Level Commitment in a particular calendar month and you make a request for service credit within fifteen (15) days after the end of such calendar month,
Less than 95.0% Monthly Uptime Percentage - 50% Service Credit (% of monthly fees for affected Cloud Product)
Covered ExperiencesJira Cloud Premium or EnterpriseView IssueCreate IssueEdit IssueView Board...Covered Experiences include browser-based experiences only (not, e.g., integrations, API calls or mobile versions).
Again, let's look at how this applies in practice:
- Outages would require a failure of only the most basic of Jira features (same thing applies to other Atlassian products but I didn't want the quoted portion to be huge). In theory, authentication being down wouldn't be included in the SLA calculation.
- Outages are defined by Atlassian's status page which customers are expected to take at its word. On top of that, you must manually contact them within 15 days to claim the outage.
- Assuming you're down for an entire month, Atlassian has limited its liability to 50% of what you have paid. I think in catastrophic cases, Atlassian would go above its own SLA policy like they did in the incident last year. But if this is what you're going to do, why not put it in the SLA in the first place.
Snowflake
You can find Snowflake's SLA document at https://www.snowflake.com/legal/support-policy-and-service-level-agreement/
The important parts to note are:
“Unavailable” is defined as an Error Rate greater than the relevant Error Rate Threshold over a one-minute interval calculated across all Accounts within each applicable Cloud Provider Region
“Error Rate” is defined as the number of Failed Operations, divided by the total number of Valid Operations. Repeated identical Failed Operations do not count towards the Error Rate
“Failed Operations” is defined as Valid Operations where the Service returns an internal error to Customer, subject to Section III (SLA Exclusions) below.
Upload
The Service Level Credit table Snowflake offers to customers
-
- Failures only count if they're a backend error and exclude basic things like authentication.
- Successful operations increase the denominator, but repeated failures don't increase the numerator.
- Assuming a customer is at 99.0% availability, that means they have had 7h 14m 41s of downtime. So, the punishment to Snowflake is roughly 3x what the customer suffers, not terrible, but also not great.
Why does this happen?
My guess is that this is some corollary to Goodhart’s Law
Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.or more commonly reworded asWhen a measure becomes a target, it ceases to be a good measure.
Teams are expected to provide a high level of service to customers. They are asked to measure that, and quickly realize it's really hard both to measure and to meet those exceptions, so instead they define metrics that look good on the surface, but are at best misaligned in terms of customer value.
Snowflake explicitly says on their blog:
These industry-standard SLA thresholds are not particularly good measures of user experience. The underlying service-level indicators (SLIs) focus on query execution, which misses many critical components of the actual user workflow, from client library behaviors to the correct results being served. They also are implicitly dependent upon the user being able to reach and authenticate to Snowflake.
They go on to talk about how other things are being done to focus on UX, but if that's the case, then why is this what their SLA measures?
Can the situation be improved?
I would love to see someone create a tool which automates SLA monitoring and claims against vendors in bulk. Something akin to what StatusGator is trying to do at: https://statusgator.com/blog/sla-monitor/ and Downhound is doing at https://www.downhound.com/about (no affiliation to either them). If you can detect that a given product is down from the customer perspective (I call this antagonistic monitoring), then demonstrate that the SLA definition is too narrow, and that the SLA metrics are dishonest, then it places the buyer in a powerful position during renewal conversations. My hope is that enough companies getting hammered enough on bad SLA performance by buyers can be a bar raiser for the whole industry.
If this were to exist, I also imagine you could have an honest public status page (I own honeststatuspage.com so if someone is building this let me know). This would be an adversarial tactic of shaming vendors into improving product quality and could again be used to feed into a procurement process.
Take a moment to follow me at https://twitter.com/NothingEasySite and https://twitter.com/borisberenberg and subscribe below ⬇️