Why are we told to ignore the p(50) and concentrate on the p(99)? The obvious reason is that if the p(99) is fine, the p(50) will be fine.
If we consider a perfect endpoint that behaves entirely normally with a mean/median/mode of 300 ms and a standard deviation of 20 ms, then three standard deviations from the normal will be 60 ms and ~99.87% of calls are faster than 360 ms (we don’t normally care about fast outliers).
Three-sigma is actually a sterner test than p(99). If my three-sigma (or p(99)) goes up to 400 ms, I might still not care because that is still within design tolerances/service level objectives. If it goes up to 2000 ms though, something probably is wrong.
In practice, it’s a lot more complicated than that. Firstly, the latency distribution of an endpoint almost certainly isn’t going to be anything close to entirely normal. It might look a bit like a skewed Gaussian, at least to a first approximation.
But that’s why we measure the p(99) rather than the three-sigma (because we can’t usually get a good estimate of the standard deviation and measure of skewness).
Secondly, the p(99) is probably generally going to be quite a lot bigger than the p(50). If the p(99) is 600 ms, you might think that’s fine, 99% of calls are adequately fast. If it’s 1100ms, perhaps not so much.
Then again, it’s only 1% of calls. Do you care?
In both cases, the p(50) could be the same or the p(50) could be 1000 ms in the second case. It might not be that 1% of calls are slow, but 50% (or more). Unless you also measure the p(50), you are never going to know that. What’s the p(99) really telling us then?
99% is a nice, round number and we should be aware of nice, round numbers just because they are easy to calculate. The crucial thing is what do we actually care about from a business operations perspective? And that’s going to be different for every organization.
So if 1% of calls take longer than 1100 ms, does that matter? Saying you are getting thousands of hits a second on your endpoint. That’s tens of calls a second or millions of calls a day getting a slow service. Perhaps the users don’t notice or care. Or perhaps they do.
What’s the business value of each call to you and your users? 1% can be a lot of users. And you have arbitrarily decided that if 99% of calls are fine, as shown by the p(99), you can ignore the other 1%. But you can’t.
How do you know that one in 200 calls or one in 1000 or one in a million isn’t taking 120 seconds to return or timing out altogether? In time-critical situations, say on trading platforms, every millisecond counts. If one call in a million times out and you get a million hits a day that could be one disaster a day for a user with untold economic consequences for both of you. And the p(99) will tell you nothing about that because the p(99) tells you nothing about what lies beyond.
This is why you have to characterize the distribution of your latency.
It’s not about p(99)s.
It’s never about p(99)s on their own. Yes, if your p(99) keeps going up, it indicates there’s a problem, but so does your p(50) going up and it might matter more because the p(50) is going to affect more users. It’s possible to have situations in which the p(50) shoots up and the entire performance of the service is seriously degraded (because almost everyone is now getting something close to the 1% service), but the p(99) doesn’t change.
It is about both outliers and p(50)s. You need to know which calls are actually your outliers. Outliers are never just the 1% of slowest calls or those beyond three-sigma. The outliers are the population (or populations) that behave differently from the main population. You need to find where your outliers begin. It might be at p(95.9) or it might be p(99.5) or even p(99.95), you won’t know unless you are constantly monitoring your endpoints (from different cloud locations and also looking at the different latency components) and looking for outliers.
What matters ultimately are the consequences in terms of business operations of slow calls. You might have a p(99) SLO in which case you might argue that what matters is the p(99) because you’ll suffer financial penalties if you don’t meet it. But should a p(99) ever be used for an SLO?
You might say, “My service works fine at 800 ms. Therefore if I set a p(99) target of 800 ms, I know 99% of all are fine. Some of the 1% of calls might be fine and I am happy to live with the consequences of those that are not. ” Or you might set a p(99) target of 600 ms. That gives you plenty of buffer room compared to 800 ms.
But you don’t know for sure it does unless you are actually looking beyond p(99). Perhaps there’s a p(99.5) (or p(99.9)) population at 1200 ms or 30 s. That could come back to bite you. And if you are worried about that bite, you need to know whether there’s something out there to bite you or not.
So, the x in p(x) is going to be unique to you and your service. How much the outliers affect you will depend on what API and its endpoints are doing and the load on the service. A small percentage of a large number is still a large number and even if multiplied by a small economic cost might be a lot of money. Alternatively, a large percentage of a small number is still a small number. Going from a few thousand hits per second to a few hits per day is a 100 million-fold decrease in hits on an endpoint.
By monitoring your API and understanding how it behaves in its business operations context, an appropriate x can be chosen, be it 50, 95, 98.4, 99, 99.999, any other number, the percentage of outliers, or a combination of metrics that allow the business performance of the API, for all stakeholders, to be maximized.