SPM-2024-003 - DNS Service Disruption

TL;DR

The Scalingo platform provides a DNS (Domain Name System) resolution service, which is a network protocol used by applications to resolve host names to IP addresses. On July 24th, a DNS service disruption occurred, affecting 50% of DNS traffic on the osc-fr1 and osc-secnum-fr1 regions for 40 minutes, followed by a full disruption on osc-secnum-fr1 during 5 minutes. Applications heavily reliant on DNS resolution may have been affected.

Integrity and confidentiality of your data and applications were never at risk during the incident.

What happened?

At Scalingo, we provide a Platform as a Service (PaaS) that includes DNS resolution functionality. DNS resolution involves translating domain names into IP addresses. This allows hosted apps to establish connections to third-party services such as databases and external APIs.

On Wednesday, July 24th, our internal DNS resolver infrastructure experienced a partial service disruption. The osc-fr1 and osc-secnum-fr1 regions faced a service degradation from 08:20 to 09:00 AM, followed by a brief full disruption on osc-secnum-fr1 from 09:00 to 09:05 AM.

The root cause was traced back to an upgrade of the bind component. The DNS Servers are running on the Ubuntu Server operating system. This system provide a mechanism which automatically installs security updates to maintain the system integrity without manual oversight called unattended-updates. This system was enabled for the bind component and already ran multiple time smoothly. However on July 24th at 08:20 AM, on the osc-fr1 region and a bit later on the osc-secnum-fr1 region this process was triggered and caused a partial disruption of the service. The upgrade process failed because the newly deployed version of bind dropped the support for a configuration option named dnssec-enable. Notably, this parameter has been deprecated as DNSSEC is now enabled by default in recent bind releases. This triggered a configuration error and prevented the services from being restarted, causing the service disruption.

Each region within our platform is configured with two DNS resolvers. The DNS requests on the platform are evenly distributed between the two DNS resolvers. This is done via the resolv.conf file. This file is the standard on Linux systems for defining the DNS resolution configuration. This file includes, among other things, the list of DNS server to use for the DNS resolution. This looks file usually looks like:

nameserver 10.255.0.10
nameserver 10.255.0.11

Where 10.255.0.10 and 10.255.0.11are the IPs of the two DNS resolvers. This file is then used by applications (either directly of via libraries (libc, etc)) to perform DNS resolutions. Most libraries use the same core mechanism: first try the resolution on the first nameserver, then, if it fails, try it on the second nameserver. However the time to check if the first resolver is up can be long and thus the failover could take a considerable amount of time.

Finally, the order of those two nameserver are randomized on the platform, thus only 50% of applications were directly impacted (either by being completely disrupted or by having a strong latency in DNS resolutions depending on the client implementations). The other 50% of applications were not affected.

Furthermore, the DNS resolvers are designed such as, if one resolver were to fail, the other one could take the entire load of the DNS resolution. Unfortunately this mechanism was designed for complete hardware failure and not for a single software failure and thus was not activated during the incident. We’ve identified this shortfall and are actively working on improving our failover strategies, as outlined in the action plans further in the document.

The anomaly was detected at 08:40 AM by our monitoring systems in osc-fr1. By 08:45 AM, our operations team manually adjusted the configuration by removing the outdated parameter. Normal service resumed at 9:00 AM. This workaround was immediately replicated when a similar issue occurred at 09:00 AM in the osc-secnum-fr1 region.

To make these changes persistent, they were propagated across all servers via our configuration management tool which runs continuously on the platform to ensure the consistency of the configuration of our servers.

There was a 20-minute delay between the onset of the incident on our platform and the time we were alerted, which was caused by inadequate monitoring. Our system alerted us to the consequences of the incident rather than its root cause. We’ve identified this issue and are currently working on enhancing our monitoring capabilities to address it.

Timeline of the incident

All times given are in CEST (UTC+2)

Jul. 24 08:20	Unattended update of `bind` on one of our 2 DNS resolvers in the `osc-fr1` region
Jul. 24 08:34	Unattended update of `bind` on one of our 2 DNS resolvers in the `osc-secnum-fr1` region
Jul. 24 08:40	Monitoring system triggers a call to the on-call operator
Jul. 24 08:45	The on-call operator begins investigations
Jul. 24 08:59	Deployment of a workaround on `osc-fr1`
Jul. 24 08:59	Unattended update of `bind` on the other DNS resolver in the `osc-secnum-fr1` region. Now both DNS resolvers are unreachable in this region
Jul. 24 09:01	End of the incident on `osc-fr1`
Jul. 24 09:04	Deployment of the workaround on `osc-secnum-fr1` (same as `osc-fr1`)
Jul. 24 09:05	End of the incident on `osc-secnum-fr1`

Impact

On osc-fr1, 50% of the DNS traffic was impacted during 40 minutes. Depending on your applications, this may have been noticeable (errors when establishing a connection to your databases or third-party services, for example).
On osc-secnum-fr1, 50% of the DNS traffic was impacted during 27 minutes, followed by a full disruption during 5 minutes. Depending on your applications, this may have been noticeable (errors when establishing a connection to your databases or third-party services, for example).

Immediate Actions Taken

Manual suppression of the obsolete dnssec-enable parameter in the bind configuration file
Remove this parameter from the bind configuration file deployed through our infrastructure configuration management tool.

Actions in progress

Improvement of our monitoring system to detect the loss of a service immediately
Improvement of our DNS failover process so that the loss of the bind service triggers a failover
Analyze how we can ensure that packages updated by Ubuntu’s Unattended Upgrade process are first applied to our test environments before being deployed in production

Financial compensation

Our service level agreement (SLA) guarantees 99.9% availability for applications deployed across two or more containers, which translates to a maximum of 43 minutes of downtime per month. This incident, which may have impacted your applications for up to 40 minutes, is within the bounds of our SLA commitments.

If you consider that you have experienced more than 43 minutes of unavailability over the month of July, please contact our support team.

Changelog

2024-08-02 : Initial version

Last update: 02 Aug 2024 Suggest edits