SPM-2024-003 - DNS Service Disruption
TL;DR
The Scalingo platform provides a DNS (Domain Name System) resolution service, which is a network protocol used by applications to resolve host names to IP addresses. On July 24th, a DNS service disruption occurred, affecting 50% of DNS traffic on the osc-fr1
and osc-secnum-fr1
regions for 40 minutes, followed by a full disruption on osc-secnum-fr1
during 5 minutes. Applications heavily reliant on DNS resolution may have been affected.
Integrity and confidentiality of your data and applications were never at risk during the incident.
What happened?
At Scalingo, we provide a Platform as a Service (PaaS) that includes DNS resolution functionality. DNS resolution involves translating domain names into IP addresses. This allows hosted apps to establish connections to third-party services such as databases and external APIs.
On Wednesday, July 24th, our internal DNS resolver infrastructure experienced a partial service disruption. The osc-fr1
and osc-secnum-fr1
regions faced a service degradation from 08:20 to 09:00 AM, followed by a brief full disruption on osc-secnum-fr1
from 09:00 to 09:05 AM.
The root cause was traced back to an upgrade of the bind component. The DNS Servers are running on the Ubuntu Server operating system. This system provide a mechanism which automatically installs security updates to maintain the system integrity without manual oversight called unattended-updates
. This system was enabled for the bind
component and already ran multiple time smoothly. However on July 24th at 08:20 AM, on the osc-fr1
region and a bit later on the osc-secnum-fr1
region this process was triggered and caused a partial disruption of the service. The upgrade process failed because the newly deployed version of bind
dropped the support for a configuration option named dnssec-enable
. Notably, this parameter has been deprecated as DNSSEC is now enabled by default in recent bind
releases. This triggered a configuration error and prevented the services from being restarted, causing the service disruption.
Each region within our platform is configured with two DNS resolvers. The DNS requests on the platform are evenly distributed between the two DNS resolvers. This is done via the resolv.conf
file. This file is the standard on Linux systems for defining the DNS resolution configuration. This file includes, among other things, the list of DNS server to use for the DNS resolution. This looks file usually looks like:
nameserver 10.255.0.10
nameserver 10.255.0.11
Where 10.255.0.10
and 10.255.0.11
are the IPs of the two DNS resolvers. This file is then used by applications (either directly of via libraries (libc, etc)) to perform DNS resolutions. Most libraries use the same core mechanism: first try the resolution on the first nameserver
, then, if it fails, try it on the second nameserver
. However the time to check if the first resolver is up can be long and thus the failover could take a considerable amount of time.
Finally, the order of those two nameserver
are randomized on the platform, thus only 50% of applications were directly impacted (either by being completely disrupted or by having a strong latency in DNS resolutions depending on the client implementations). The other 50% of applications were not affected.
Furthermore, the DNS resolvers are designed such as, if one resolver were to fail, the other one could take the entire load of the DNS resolution. Unfortunately this mechanism was designed for complete hardware failure and not for a single software failure and thus was not activated during the incident. We’ve identified this shortfall and are actively working on improving our failover strategies, as outlined in the action plans further in the document.
The anomaly was detected at 08:40 AM by our monitoring systems in osc-fr1
. By 08:45 AM, our operations team manually adjusted the configuration by removing the outdated parameter. Normal service resumed at 9:00 AM. This workaround was immediately replicated when a similar issue occurred at 09:00 AM in the osc-secnum-fr1
region.
To make these changes persistent, they were propagated across all servers via our configuration management tool which runs continuously on the platform to ensure the consistency of the configuration of our servers.
There was a 20-minute delay between the onset of the incident on our platform and the time we were alerted, which was caused by inadequate monitoring. Our system alerted us to the consequences of the incident rather than its root cause. We’ve identified this issue and are currently working on enhancing our monitoring capabilities to address it.
Timeline of the incident
All times given are in CEST (UTC+2)
Jul. 24 08:20 | Unattended update of bind on one of our 2 DNS resolvers in the osc-fr1 region |
Jul. 24 08:34 | Unattended update of bind on one of our 2 DNS resolvers in the osc-secnum-fr1 region |
Jul. 24 08:40 | Monitoring system triggers a call to the on-call operator |
Jul. 24 08:45 | The on-call operator begins investigations |
Jul. 24 08:59 | Deployment of a workaround on osc-fr1
|
Jul. 24 08:59 | Unattended update of bind on the other DNS resolver in the osc-secnum-fr1 region. Now both DNS resolvers are unreachable in this region |
Jul. 24 09:01 | End of the incident on osc-fr1
|
Jul. 24 09:04 | Deployment of the workaround on osc-secnum-fr1 (same as osc-fr1 ) |
Jul. 24 09:05 | End of the incident on osc-secnum-fr1
|
Impact
-
On
osc-fr1
, 50% of the DNS traffic was impacted during 40 minutes. Depending on your applications, this may have been noticeable (errors when establishing a connection to your databases or third-party services, for example). -
On
osc-secnum-fr1
, 50% of the DNS traffic was impacted during 27 minutes, followed by a full disruption during 5 minutes. Depending on your applications, this may have been noticeable (errors when establishing a connection to your databases or third-party services, for example).
Immediate Actions Taken
- Manual suppression of the obsolete
dnssec-enable
parameter in thebind
configuration file - Remove this parameter from the
bind
configuration file deployed through our infrastructure configuration management tool.
Actions in progress
- Improvement of our monitoring system to detect the loss of a service immediately
- Improvement of our DNS failover process so that the loss of the
bind
service triggers a failover - Analyze how we can ensure that packages updated by Ubuntu’s Unattended Upgrade process are first applied to our test environments before being deployed in production
Financial compensation
Our service level agreement (SLA) guarantees 99.9% availability for applications deployed across two or more containers, which translates to a maximum of 43 minutes of downtime per month. This incident, which may have impacted your applications for up to 40 minutes, is within the bounds of our SLA commitments.
If you consider that you have experienced more than 43 minutes of unavailability over the month of July, please contact our support team.
Changelog
2024-08-02 : Initial version