SLO: how to define and measure a Service Level Objective and how that relates to a Service Level Agreement.
This is a complicated field with a lot of moving parts so buckle up and we’ll cover the key concepts.
APImetrics is a tool for monitoring the performance and quality of APIs. There are lots of good reasons why you should do that, but one particularly important one is to see if an API meets its Service Level Agreement (SLA). But as a team of Google engineers point out in their O’Reilly book Reliability Engineering: How Google Runs Production Systems, an SLA is actually a legal document (or, in the case of an internal API, a formal agreement between business units).
So, the whole SLA places a legal wrapper around one of more Service Level Objectives (SLOs). For an API, examples of SLOs might be:
- 99.9% availability
- a median latency less than 300 ms from all cloud locations
- always returning a payload with a size greater than 25 kB
Or you could use the APImetrics CASC score, as this provides a blended quality measure that takes into account a number of different metrics. Thus, another example of an SLO might be a weekly CASC score greater than 7.5. Although, we might recommend you use CASC for spotting potential SLO problems and not as a metric to directly track!
API monitoring systems are not yet in the business of determining whether or not a contract has been fully honored in the legal sense. But they are very good at measuring the metrics associated with SLOs. Now, measuring API metrics is a non-trivial thing to do. In a situation in which there is a formal SLA in place, you don’t want to be doing it yourself with a cURL script and cronjob, not least of which, it’s bad practice to self certify that you’re doing as you say you are.
Independent Measurement aka don’t mark your own work!
Using a third-party product doesn’t only mean that there is an independent arbiter providing the information to decide whether or not an SLO has been met. It also means that you don’t run the risk of having somebody internally making the numbers look better than they actually are, and thereby creating problems for the Customer Success teams.
Something else that can be done with APImetrics is to determine SLOs. If you are API provider, you want to know how your API performs in order to determine what kind of SLO might be reasonable to agree to. And just as a provider doesn’t want to agree to an SLO that is too onerous, an API consumer can use APImetrics to make sure that an SLO isn’t too lax.
APIs are increasingly the glue of digital age. More and more companies on depending on APIs they expose and consume to enable time-sensitive mission-critical business activities. With organizations relying on APIs, SLAs become a necessity.
APImetrics allows companies to understand what reasonable SLOs might go in the SLA, and to avoid finger-pointing when it comes whether or not the SLOs have been met.
Measuring the Service Level Objectives
When you set up monitoring on your APIs, and before you set objectives, it is essential to see how they function normally. We would recommend that you follow the following key guides on this:
- Measure from where your customers are based – so you can see things that might impact your ability to deliver that can be fixed or accounted for in your objectives
- Don’t set un-realistic goals – 99.999% uptime is a goal but if the API can only realistically deliver 99.9% over a given month, there’s not much point in identifying a metric you can’t hit.
- Set percentile based metrics for goals on latency not averages – not to be MEAN about this, but you can hide a lot of bad stuff under the carpet of an Average. Set the average high enough and the API can essential be failing for hours a day and still not meet any alerting thresholds
Focus on the KPIs that are meaningful for you and fix the ones that are under performning.