Signalling unavailable

Incident Report for AVStack

Postmortem

On 2021-10-14, for a period of approximately 15 minutes (05:20 ~ 05:35 UTC), signalling was unavailable globally on AVStack, resulting in new conferences failing to start successfully and new participants being unable to join existing conferences.

During this time, our API and console were also unavailable.

Participants that were already in a conference were not affected, but their metrics were not collected during the affected 15-minute period.

We deeply apologise for the impact of this outage on our customers' businesses. The details below explain what happened and the steps we will take to prevent a recurrence of any similar issue.

Background

AVStack uses databases in each signalling location (currently Singapore, Germany and eastern USA) to store stack-level data. This data is kept local to the signalling location, both for performance reasons, and to comply with data locality requirements in certain jurisdictions.

We also use a separate database in Singapore to store account-level data and usage data for billing purposes.

At each of these locations, we use replication to provide redundancy across AWS availability zones. An active database server continuously replicates all changes to a standby database server located in a different availability zone. In the event that the active database server fails, the standby can be “promoted” to take its place. This process is completely automated.

An existing project, started earlier this month, is working to implement replication of the account-level database to each signalling location, to allow local queries of this information and provide read-only global redundancy in the event of failure. This project, once complete, will reduce the impact of this type of failure.

What happened in this outage

The “active” database server for the primary database in Singapore went into a degraded state, probably due to a failure of the hardware underlying the VM instance. AWS automatically initiated a failover of the database to one of the replicas, which under normal circumstances takes 1-2 minutes. However, at the time the failure occurred, a batch process was underway which had just committed a large transaction. In these circumstances, this can cause failover to take longer, as the recently-committed large transaction had not yet been fully written to persistent storage and needed to be replayed. This process took approximately 8 minutes.

During these 8 minutes, the primary database was unavailable for reads or writes. Our authorisation server, which decides whether to permit or deny new participants in conferences, relies on this server cluster for account-level data such as the subscription plan that the account is on. This data is cached in each location, but the cache has a limited lifetime. Due to this caching, stacks which are accessed frequently may have experienced the beginning of the failure a few minutes later than stacks which are accessed infrequently.

Because the authorisation server could not access the primary database, it responded with an error to authorisation requests from each stack’s XMPP server, which caused the XMPP servers to deny new participants.

Once the failover completed, stacks started to allow joins again at differing rates as their backoff timers expired. All components that access our databases implement short backoffs when they fail to connect, to prevent any minor failure from being amplified into a larger overload failure by a large quantity of rapid retries.

The period of time during which each stack was unable to authorise new participants was between 8 and 13 minutes.

Our engineers identified the cause of the failure within a few minutes of it starting, but as there was no action they could take to speed up the rate of transaction replay on the database server being promoted from standby to active, engineering activity during this failure was limited to monitoring the automatic failover and recovery process.

Analysis of factors contributing to the outage

Our current database architecture was designed to provide a maximum downtime of 2 minutes in the event of the failure of a single EC2 availability zone. Our existing improvement project, adding replication to each signalling location, is designed to limit this downtime to the API and console, allowing signalling (and thus conferences) to continue to operate even in the event of a database outage.

In the specific circumstances of this failure, this 2-minute design goal was not achieved due to the access pattern of one of our batch jobs causing a longer failover time. This access pattern could be modified to produce smaller transactions, reducing or removing the impact on failover time if a failover occurs soon after the batch job commits.

The authorisation server currently returns an error in the event that it is unable to authorise the participant due to an internal error such as inability to access the database. This could be improved by differentiating between failure of the different databases (stack-level and account-level). If the stack-level database is not accessible, the current behaviour should be preserved because the stack may have specific authorisation requirements (for example JWT) and it could be unsafe to allow the participant to join when these requirements cannot be applied. However, if only the account-level database is not accessible, the authorisation could “fail open” since the only potential impact would be allowing an account to exceed their seat limit during the outage.

Actions we will take to prevent a recurrence

We will continue the project to replicate account-level data to signalling locations. Once complete, this project will reduce the impact of this type of failure to the API and console, while signalling (and thus conferences) will be unaffected. We expect to complete this project by 2021-10-21.
We will investigate options for altering the way our batch jobs interact with the database, to produce smaller transactions that have a smaller impact on failover time in the unlikely event that failover occurs soon after batch processing.
We will modify the authorisation server component to “fail open” in the event that it cannot access account-level data. Once complete, this type of failure will merely prevent seat limits being enforced during the failure period, but will otherwise have no visible customer impact. (UPDATE 2021-10-15 This modification has been completed.)

Posted Oct 14, 2021 - 07:53 UTC

Resolved

Monitoring is complete and the incident has been confirmed as resolved. The outage period was approximately 05:20 UTC - 05:35 UTC.

Posted Oct 14, 2021 - 06:15 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 14, 2021 - 05:35 UTC

Identified

An issue with backend database servers provided by AWS is affecting signalling availability in all locations, as well as the API and console. We have identified the issue and are implementing a fix.

Posted Oct 14, 2021 - 05:20 UTC

This incident affected: API, Console and Managed Jitsi Meet Platform (Signalling - Singapore, Signalling - Frankfurt, Signalling - Virginia).