Anatomy of an Outage: Auth0

Autho Logo

Auth0 suffered a major outage on June 21. It was part of a wider issue with Cloudflare, their cloud edge provider.

Their status page tracks the problems reported, which now states the issue happened at 06:27 UTC. But their Twitter feed (@auth0Status) first reported the issue at 07:05 AM – 38 minutes later.

Enter Serinus

Our Serinus monitors are lightweight HTTPS calls that verify a domain is available. They check for two metrics – failing APIs – HTTP calls that either failed to connect or connected and returned a 5XX server error and slow calls – calls that take longer than expected for DNS lookup or to make a TCP connection.

Note: instead of reporting a failure rate (e.g. 4%), we report the inverse – the availability (e.g. 96%). Our incident monitor looks at a rolling 15-minute window of results to be an early warning system for major issues with public APIs. We use our global network of agent locations and report if one specific cloud or region is worse affected than other regions.

Our monitoring first categorized the Auth0 APIs as having an issue at 06:29 UTC, when the availability dropped to 98%. These first issues were spotted in North America and from our IBM locations.

By 06:33 UTC, we raised our categorization of the problem to a minor outage – under 75% availability – and it was affecting all regions and clouds.

Well, that escalated quickly

This escalated to a Major Outage (under 50% availability) at 06:38 and a Critical Outage (under 25% availability) by 06:44. This state continued until 07:11, then recovered back to a minor outage at 07:21, and then to of concern at 07:27 and finally the incident was resolved at 07:57.

The initial period of escalating outage involved failing APIs – HTTP calls that either failed to connect or connected and returned a 5XX server error.

In recovery

During the recovery period, from about 06:45 UTC, some API calls started passing but responded slowly, at a rate of about 1.5% of calls.

  • Between 06:45 and 07:00 European and IBM locations were affected more than other locations.
  • Between 07:06 and 07:11 Azure and Asian locations were worst affected.
  • Between 07:14 and 07:51 API calls from our Google locations were affected more than other locations.

The regions that were worst affected changed during this time, starting first in Asia, then Oceania, and finally North America.

The lowest availability for a 15-minute window was 17.52% between 06:30 and 06:45, but it took until 07:10 at the earliest to improve noticeably.

Summary of events

  • Concern: 06:29 – 06:32
  • Minor: 06:33 – 06:36
  • Major: 06:37 – 06:42
  • Critical: 06:43 – 07:10
  • Major: 07:11 – 07:19
  • Minor: 07:20 – 07:26
  • Concern: 07:27 – 07:55

Total incident time: 86 minutes

Sign up to learn more about Serinus!

Serinus is currently in a closed Beta but leave your details and we will get back to you. You can also follow Serinus on Twitter.

Leave A Comment

Go to Top