Admin console and API issues
Incident Report for GrapheneDB
Postmortem

Timeline

  • 18:44 UTC: A series of errors alerted our team of our backend database having issues.
  • 18:52 UTC: Our engineer on call escalated the issue after confirming our backend database was down. This affected some services like our admin interface and our authentication caching layer.
  • 19:05 UTC: Our on call support team started responding affected customers personally and opened a public incident.
  • 19:12 UTC: Our ops team detected the issue to be too many open connections to the database and proceeded to increase the limit.
  • 19:20 UTC: Additionally some internal services responsible for the high number of connections to the database were restarted/redeployed.
  • 19:32 UTC: Our backend database recovered and became fully operational.

Root cause analysis
Our backend database ran over the limit of allowed connections because of a back-office process after some indexing changes on our backend.
This issue resulted in our admin console being down for around 45 min.
Additionally there was a small percentage of customer's databases affected, the ones for which their authentication had expired on our authentication caching layer.

Remediation
We are going to take a number of actions to prevent this from happening in the future, some of them have been already implemented:

  1. We're going to deprecate the internal service that was responsible for the increase of number of connections.
  2. We've set up a much higher default number of open connections limit for our backend database.
  3. We've added to our monitoring infrastructure more related metrics to get alerted in case a similar problem happens in the future.
  4. We are moving to native authentication in the coming weeks, so any incident related to our backend database is not going to affect customer's deployments anymore.
Posted Apr 17, 2020 - 20:05 BST

Resolved
After 8 hours of monitoring we are going to resolve this incident. As soon as we gather all the details, we're going to append a postmortem to this incident.
Posted Apr 17, 2020 - 08:57 BST
Monitoring
We have managed to resolve the problem. We're monitoring it closely now.
Posted Apr 16, 2020 - 20:35 BST
Identified
We have identified the issue to be with our MongoDB database, which we use for our admin console. We're additionally looking into how this can be affecting customer's databases.
Posted Apr 16, 2020 - 20:25 BST
Update
We are continuing to investigate this issue.
Posted Apr 16, 2020 - 20:20 BST
Update
We have some customers experiencing issues with their databases too. We're also looking into this and will update this incident as soon as there is any news.
Posted Apr 16, 2020 - 20:19 BST
Investigating
We're looking into a problem that is affecting our API and admin console. Updates to follow.
Posted Apr 16, 2020 - 20:05 BST
This incident affected: User interface and public API, Standard Database Tier, and Performance Database Tier.