Availability issue impacting a small number of Standard databases hosted in EU (Frankfurt) Region
Incident Report for GrapheneDB
Postmortem

Timeline of events:

  • At 18:18 UTC our automated monitoring systems starts receiving 500 error codes in response to its health checks on a small number of databases
  • Our active monitoring systems is designed to identify this as a mal-functioning Neo4j instance that is unable to serve requests, so it triggers a restart of the affected databases.
  • These failed checks and active monitoring decisions result in a number of Standard databases becoming unavailable
  • All affected databases are hosted within the same multi-tenant server, hosted in the Ireland AWS region
  • At 18:49 UTC the incident is escalated to our on-call engineering team, which starts working on the issue immediately.
  • The issue is pinpointed to a cache issue that is impacting the component responsible for routing the requests and validating authentication credentials.
  • Upon restarting this component at 19:04 UTC, databases are able to start again and start responding to incoming requests and monitoring checks.

Root cause:

This has been an isolated incident. Our engineers continue to investigate the root cause.

Posted Apr 23, 2018 - 14:24 BST

Resolved
This incident has been resolved. Please see post-mortem for details.
Posted Apr 22, 2018 - 20:04 BST