- 18:44 UTC: A series of errors alerted our team of our backend database having issues.
- 18:52 UTC: Our engineer on call escalated the issue after confirming our backend database was down. This affected some services like our admin interface and our authentication caching layer.
- 19:05 UTC: Our on call support team started responding affected customers personally and opened a public incident.
- 19:12 UTC: Our ops team detected the issue to be too many open connections to the database and proceeded to increase the limit.
- 19:20 UTC: Additionally some internal services responsible for the high number of connections to the database were restarted/redeployed.
- 19:32 UTC: Our backend database recovered and became fully operational.
Root cause analysis
Our backend database ran over the limit of allowed connections because of a back-office process after some indexing changes on our backend.
This issue resulted in our admin console being down for around 45 min.
Additionally there was a small percentage of customer's databases affected, the ones for which their authentication had expired on our authentication caching layer.
We are going to take a number of actions to prevent this from happening in the future, some of them have been already implemented:
- We're going to deprecate the internal service that was responsible for the increase of number of connections.
- We've set up a much higher default number of open connections limit for our backend database.
- We've added to our monitoring infrastructure more related metrics to get alerted in case a similar problem happens in the future.
- We are moving to native authentication in the coming weeks, so any incident related to our backend database is not going to affect customer's deployments anymore.