SPM-2023-001 - Partial unavailability of hosted applications on osc-secnum-fr1
On Wednesday, 4th October, a disruption affected Cloudflare’s DNS resolution on resolver
22.214.171.124. This had a side-effect on the Scalingo platform on the
osc-secnum-fr1 region, with 2 of our 4 public IP addresses no longer accepting requests for a period of 50 minutes.
While this incident affected network access to your apps, the integrity and security of your data and applications were maintained. Access to databases directly accessible on the Internet has not been affected.
Root Cause: Lack of detection of DNS errors while updating load-balancers upstream servers list.
Our platform uses load balancer servers to redirect traffic to backend servers, which we refer to as routers. These load balancers retrieve the available router servers through a service discovery process.
This process is based on the DNS protocol, we are using the
SRV field to store a list of the currently active routers on the platform.
For DNS resolution, we primarily use the DNS resolver
126.96.36.199 as a secondary option.
During Cloudflare outage,
188.8.131.52 randomly returned
SRVFAIL DNS errors. Ideally, our system should have detected this and switched to using
184.108.40.206. However, this didn’t happen, the errors were not detected, causing 2 of our 4 load balancers to be configured with an empty list of routers.
Each of the load balancers is assigned to one of our public IP addresses, meaning that 50% of the incoming traffic on the
osc-secnum-fr1 region was no longer processed. Due to the current configuration of our monitoring system, we did not receive any alerts about this event.
As a quick remedial solution, we made
220.127.116.11 our main DNS resolver for the service discovery process which has brought the platform back to full operating capacity. Later that day, we deployed a fix to prevent updating the router list on a load balancer when the service discovery returns an empty list. We are also taking steps to detect incidents faster and increase our resilience to DNS resolution disruptions.
Timeline of the incident
All times given are in CEST (UTC+2)
|09:24||Alerts are raised by our monitoring stack. Alerts concern only internal mechanisms that have no impact for our customers.|
|10:00||We identify that those alerts are related to DNS resolution issues, linked to an ongoing incident at Cloudflare. The platform remains stable, with no user impact identified.|
|11:30||Our operators observe that DNS resolution problems on
|11:50||First customers reach support telling us that they’re having random difficulties reaching their applications since 11:40. We wrongly assumed that customers are affected by the Cloudflare problem and that the reported issues may come from their DNS configuration using resolver
|11:59||We declare the incident on our status page, indicating that Cloudflare is experiencing an outage and that customers using address
|12:00||We identify a drop in the number of requests per minute processed on
|12:06||We detect that we are indeed impacted by a problem that affects our front-end load balancers on
|12:17||An emergency fix is deployed to prioritize the use of
|12:19||Traffic processed on
|12:36||Cloudflare reports on its status page that a fix has been deployed on their side and that DNS resolver
|12:48||We update our status page to indicate that the incident concerning DNS resolver
osc-secnum-fr1, 50% of incoming traffic was not redirected to applications, generating errors on the browser side.
osc-fr1, no impact for our customers.
When we performed the incident review, we noticed that our communication through our status page was incomplete and misleading as we did not mention that the platform was also impacted by the Cloudflare incident. This was a missed step in our incident process. To ensure that this could not happen again, we are working on improving our tooling system to prevent those miscommunications from happening again.
Immediate Actions Taken
- Emergency workaround: set
18.104.22.168as our main DNS resolver for the service discovery process instead of
22.214.171.124had first priority, while
126.96.36.199had secondary priority.
- A fix was deployed on October 4, at 15:00 to prevent updating the router list on a load balancer when the service discovery returns an empty list.
Actions in progress
- Improve supervision to identify any front-end server failures more quickly.
- Improve our incident management tooling to prevent future miscommunication.
- Improve our DNS failover techniques to better handle SERVFAIL answers.
This incident may have caused up to 50 minutes of interruption on your applications, which is beyond the range of our 99.9% monthly SLA for applications using 2 containers or more .
Customers having been impacted are invited to contact us for a redeem using the support.
2023-10-20 : Initial version