Hi Monitoring Industry. We need to talk….
It’s not me- it‘s you.
I first wrote a riff on this a couple of years ago, but chickened out of posting it. “Too controversial”, I thought. “I’m sure people get it”, I thought.
My apologies. I should have screamed it from the rooftops.
Here we are in 2020. APIs have become more and more important, —especially in Financial Services. And I’m more convinced than ever that, not only aren’t people getting it, but the entire industry is enabling people to do it wrong.
I was looking back at old articles on LinkedIn I’ve written. This one – “Face it, you don’t want to know when things don’t work” – from April 2018 is as true today as it was then.
So, here we go. Buckle up, I might get sweary.
Let’s talk…Hear me out. This is important
Monitoring industry. : We’re doing it wrong and we’re too self-serving and focused on the wrong audiences.
There I’ve said it.
It feels like our industry exists to tell silos what they want to know and, I think, monitor what is easy, because they have a product that can produce pretty graphs that keep management happy.
Why do I think that? Because time and time again, we are called in to referee fights between customers, Customer Success teams, and DevOps – and that shouldn’t be a thing.
Which raises my REAL question here: who should be doing the monitoring, and why?
It’s Operations, right? They monitor the IT setup. But what are their motivations? What do they want to know? What is it, precisely, that they’re overseeing?
You have an API Operations Team, so, what are they looking at? The gateway? Networking? The stack? Maybe. But does their role end where the stack plugs into the gateway?
What about all the Legacy systems lurking there in the dark like some ancient monster?
The short answer is, after five years doing this, I’m still not sure and we have multiple customers who aren’t sure, either.
How about Customer Success groups? “Well”, I hear you say, “Customer Success groups aren’t technical, so they shouldn’t be looking at our systems”. And there’s some truth in that. But they’re the ones who will get it in the neck when something doesn’t work.
Who helps when DevOps insists it’s a customer problem? Are they expected to tell Sales to tell the customer, it’s nothing to do with us?
This is made worse in a multi-cloud universe. What if the problems are localized to a particular AWS data center that just doesn’t communicate very well with your setup? What if the customer has done something wrong with their configuration, but the instructions you gave them were wrong in the first place?
A real-life example
What do you do when people don’t want you to monitor production systems? In financial services, we hear this all the time: “We can’t let people have access to a production system because it’s a real account, and that potentially opens up a number of risk profiles we don’t wasn’t to consider.”
What is riskier than not knowing if your production systems are up or down?!
Would you put a website up and then not run it against the production environment? I don’t think so.
Clearly, people verify that their API Gateway and servers are functional using monitoring tools. But that’s the BARE MINIMUM of what should be done. What if a user or application can’t get to the server? What if systems are returning errors that you tune out of your monitoring setup because, well, frankly, they’re somebodyelse’s problem – like a database error from that ancient, possibly fire breathing, legacy system owned by another part of the operation?
Is it really acceptable to sit in a meeting and say, “My bit was working fine, it’s them”?
No. It is not.
Which is why I felt I had to write this. Monitoring needs to be an end-to-end holistic approach that isn’t designed to just verify everything is working and all the easy to capture scenarios are functioning,. but Monitoring needs to actually work out where the problems are live and deal with them before they become a real crisis.
In a multi-cloud, containerized world of Third-Party Payment Providers and Account Information Service Providers – and government regulators – can anybody afford not to start looking out of the box and expanding what they think monitoring means?
To conclude: monitor everything you deliver, even the bits that are outside of your direct control or hard to measure, and schedule some time to have hard discussions internally with risk management groups.
What’s a bigger risk?: Losing control of a single test account that you should have heavily locked down anyway, or going down completely because you are giving yourself a false sense of security?
On a final note, I know I didn’t answer the question of where this holistic approach should live. I don’t have an answer to that –— but it is something CIOs and VPs of engineering should be asking themselves and their digital teams. Please lLet me know what you decide.
David O’Neill is CEO of end-to-end API monitoring solution APImetrics, and if you think this article is slightly self-serving, then you’re right. That doesn’t make him wrong though! He will be giving regular API Rants about the state of the industry and the API economy.