SPM-2024-002 - Databases and network unavailability on osc-fr1

TL;DR

On January 29th, at 20:08 (CET), one of our database nodes became unresponsive due to a sudden RAM spike. Most databases automatically recovered post-reboot, but one of them needed manual fixing. A manual error by an operator led to the execution of a command that deleted extensive metadata, causing network traffic disruption and subsequent platform downtime. We manually reconstructed the metadata using other sources and the issue was resolved by Jan. 30 01:57.

As a result of this disruption, an unnoticed failure appeared in our internal DNS resolution service. This led to increased network traffic and latency by Jan. 30 09:00. We restored complete network stability by 09:52.

While this incident affected network access to your databases, the integrity and confidentiality of your data and applications were maintained.

What happened?

On Monday, Jan. 29 20:08 (CET), one of our database nodes was unresponsive. This was due to a spike demand of RAM on the node. The system attempted to allocate multiple gigabytes of memory in response to this demand. To be able to allocate this memory, the system tried to rapidly clear the disk cache. However, the system was unable to free up the required memory fast enough, leading to the host freezing.

An operator was alerted through our paging system to address the issue. Our response procedure for such incidents is well documented and largely automated. Once the host was rebooted, an automated process initiated the restart of all databases on that node. This process worked as expected for most of the databases. However, one database was still detected as being down, requiring a manual intervention by the operator.

At Scalingo, all databases run in a private network, managed by a software known as Sand. In this particular case, the persistent downtime of the database was attributed to an error in the metadata of Sand. Our backup process for databases involves initiating a new container within Sand network, conducting the backup, and then removing the container. During the previous night, there was an issue with this routine. When we attempted to destroy the backup container, the operation failed. This resulted in inconsistencies in the Sand database, with some metadata entries being deleted while others were still there.

The process to resolve such a situation is to use the Sand Command Line Interface (CLI) to delete the leaked metadata. However, in this case, since the container was not fully deleted, the CLI method was unsuccessful. Therefore, our operator had to manually clean up the erroneous metadata inside the Sand database.

Sand relies on a storage technology known as etcd, which is managed using a CLI called etcdctl. To delete a key from etcd, you can use a command called etcdctl del. When executed with a single argument (e.g., etcdctl del a), it deletes only the specific key a. But, if given two arguments like etcdctl del a z, it performs a range delete, erasing not just a and z, but all keys in between. This oversight in the user experience has already been reported to the etcd team in 2022.

Due to a bad copy/paste, a space was added inside the command, initiating the deletion of an extensive range of metadata in Sand.

All Sand instances running on our hosts detected the changes in their databases and began implementing the updated configuration. This new configuration included the removal of entries in the ARP/FDB tables, which are used by the system to handle where and how to route packets in the private networks. Without this configuration, traffic inside the private database networks was not routed anymore, consequently causing downtime.

For the etcd databases, our choice was to not have backup, based on the following rationale:

The database is designed for high availability, with redundancy across multiple hosts and robust monitoring systems in place to detect any issues.
A daily backup of the database would not help much since the infrastructure changes a lot within a day, restoring a valid state would require too much work to be worth it.
In case of a major incident, it is possible to rebuild the entire database using alternative metadata sources.

However, we did not anticipate this scenario where the entire infrastructure would remain operational but the data within Sand would become invalid. Thus, we did not have the tools nor procedures to effectively manage this specific situation.

At this point, the on-call team had to make a decision. We had two options: we could either wipe the rest of the data in etcd and rebuild it from our other sources, or attempt to manually reconstruct the database content from the metadata present in the running containers.

Ultimately, we opted for the second option, since the alternative would have required to restart every database on the region, a process that could take many hours and potentially lead to more extensive platform downtime.

Our engineers focused on developing a method to recover the Sand metadata from the running containers. Two hours later, they executed a script across the entire platform to reconstruct this metadata.

Once all the metadata was fetched, we initiated the import scripts, which took an additional 30 minutes.

The operation was successful, resulting in all databases on the platform being fully operational again.

For the next two hours, we conducted post-incident checks to ensure system stability and functionality. The incident was officially declared resolved at 01:57 (CET).

However, one issue was missed during the post-incident checks. One of the databases affected was a Redis instance used for our internal DNS Resolution.

If you try to resolve a database domain on the Scalingo network, the request is handled by a service called GoDNS. This server relies on a local Redis database to determine the internal network address of a database. However, if this service is non-functional or if the value is not found inside the Redis database, the domain name resolution defaults to our public DNS servers.

After this incident, GoDNS was non-operational. The issue was undetected, as our DNS Stack continued providing valid responses through public DNS servers, allowing applications to maintain databases connectivity.

GoDNS is mainly used for resilience and caching purposes. Publicly exposed databases are usually resolved to our public TCP gateway by Public DNS. For internal communication this would add a very high overhead since communication between an app and its db will have to go through the internet. Thus, even for internet-exposed databases, GoDNS typically resolves domains to internal network IPs.

With GoDNS down, internal traffic to internet-exposed databases began routing through our public TCP gateway. This went unnoticed at the time due to low traffic volume. However, on Jan. 30 08:00 (CET), the amount of traffic going through those endpoints started to saturate the component responsible for handling our network on our infrastructure provider. This was materialized by latency going up and eventually packet loss across the entire osc-fr1 region. By 09:00, our operators were called due to an issue on the platform, 10 minutes later GoDNS was restarted.

After restarting GoDNS, applications with publicly accessible databases were also rebooted. On Jan. 30 09:52, network stability was completely restored.

Timeline of the incident

All times given are in CET (UTC+1)

Jan. 29 20:05	One of our database nodes is misbehaving.
Jan. 29 20:17	The databases hosted on this node are being restarted.
Jan. 29 20:38	The automated process finished. One database still had to be manually handled.
Jan. 29 20:47	The `etcd del` command is executed. The routing is stopped in many database private networks.
Jan. 29 20:51	We are notified of an elevated error rate on many components.
Jan. 29 20:52	The incident severity is raised. Two more operators are called to help with the incident management.
Jan. 29 20:52	The source of the incident is found: the Sand database is missing data.
Jan. 29 21:35	Resolution path is chosen, our operators are starting to work on recovery scripts.
Jan. 29 23:24	The construction process of the missing metadata is started on the platform.
Jan. 30 00:08	Start of the ingestion of the missing metadata.
Jan. 30 00:35	The ingestion is complete. All databases are back up and running.
Jan. 30 01:57	Post incident checks are finished, the incident is considered over.
Jan. 30 08:00	First signs of network unavailability.
Jan. 30 09:06	The network on `osc-fr1` is heavily degraded, alerts are being triggered and errors are reported by customers on the support.
Jan. 30 09:09	The incident is re-opened.
Jan. 30 09:11	The component in charge of our internal DNS Resolution is restarted.
Jan. 30 09:17	Applications with a database that is publicly available are being restarted.
Jan. 30 09:52	The network performance is stable again.
Jan. 30 10:16	Incident is closed.

Impact

On osc-secnum-fr1, no impact for our customers. On osc-fr1:

Databases were unavailable for 3 hours and 38 minutes (From Jan. 29 20:47 to Jan. 30 00:35).
Network Access to the applications was degraded for 52 minutes (From Jan. 30 09:00 to Jan. 30 09:52).

Remediation plan in progress

etcd backups will be enabled. Our goal is to design a plan that will permit easy regular etcd backups and also provide a way for our operators to easily backup and restore our etcd database.
We will publish a fix to our Sand API to be able to delete metadata linked to a container even if some parts of the metadata have already been deleted. This is to avoid manual interventions on the etcd database.
Our procedures regarding manual actions on etcd will be updated to make a backup before any actions on the database and prevent the usage of etcdctl del with more than 1 argument.
We are working on a Private Network feature. A replacement to the Sand software is currently being written with a focus on reliability. Our long term goal is to replace Sand with this new software.
GoDNS is being deprecated and will be replaced by CoreDNS. The goal of this replacement is to be able to build our network stack on more modern and reliable software. This project includes requirements that will prevent the DNS Incident and make other incidents like this one much easier to detect.
A new team was staffed last year to work on the specific topic of improving the underlying infrastructure components. Our goal is to be able to move from a large region to multiple smaller ones. This approach will make the impact of such incidents more localized and thus reduce the impact of such manipulations.

Financial compensation

This incident may have caused up to 4 hours and 30 minutes of interruption on your applications and databases, which is beyond the range of our 99.9% monthly SLA for applications using 2 containers or more and addons using Business plans.

Changelog

2024-01-31 : Initial version

Last update: 31 Jan 2024 Suggest edits