SPM-2023-001 - Partial unavailability of hosted applications on osc-secnum-fr1

TL;DR

On Wednesday, 4th October, a disruption affected Cloudflare’s DNS resolution on resolver 1.1.1.1. This had a side-effect on the Scalingo platform on the osc-secnum-fr1 region, with 2 of our 4 public IP addresses no longer accepting requests for a period of 50 minutes.

While this incident affected network access to your apps, the integrity and security of your data and applications were maintained. Access to databases directly accessible on the Internet has not been affected.

What happened?

Root Cause: Lack of detection of DNS errors while updating load-balancers upstream servers list.

Our platform uses load balancer servers to redirect traffic to backend servers, which we refer to as routers. These load balancers retrieve the available router servers through a service discovery process.

This process is based on the DNS protocol, we are using the SRV field to store a list of the currently active routers on the platform.

For DNS resolution, we primarily use the DNS resolver 1.1.1.1, with 8.8.8.8 as a secondary option.

During Cloudflare outage, 1.1.1.1 randomly returned SRVFAIL DNS errors. Ideally, our system should have detected this and switched to using 8.8.8.8. However, this didn’t happen, the errors were not detected, causing 2 of our 4 load balancers to be configured with an empty list of routers.

Each of the load balancers is assigned to one of our public IP addresses, meaning that 50% of the incoming traffic on the osc-secnum-fr1 region was no longer processed. Due to the current configuration of our monitoring system, we did not receive any alerts about this event.

As a quick remedial solution, we made 8.8.8.8 our main DNS resolver for the service discovery process which has brought the platform back to full operating capacity. Later that day, we deployed a fix to prevent updating the router list on a load balancer when the service discovery returns an empty list. We are also taking steps to detect incidents faster and increase our resilience to DNS resolution disruptions.

Timeline of the incident

All times given are in CEST (UTC+2)

09:24 Alerts are raised by our monitoring stack. Alerts concern only internal mechanisms that have no impact for our customers.
10:00 We identify that those alerts are related to DNS resolution issues, linked to an ongoing incident at Cloudflare. The platform remains stable, with no user impact identified.
11:27 On osc-secnum-fr1, 2 of our 4 load balancers are no longer operating correctly: incoming traffic on these 2 servers, which each hold one of our public IP addresses, is no longer redirected to our backend servers (router).
11:30 Our operators observe that DNS resolution problems on 1.1.1.1 become major, with 2 out of 3 requests returning an error SERVFAIL DNS.
11:50 First customers reach support telling us that they’re having random difficulties reaching their applications since 11:40. We wrongly assumed that customers are affected by the Cloudflare problem and that the reported issues may come from their DNS configuration using resolver 1.1.1.1.
11:59 We declare the incident on our status page, indicating that Cloudflare is experiencing an outage and that customers using address 1.1.1.1 as DNS resolver may be affected.
12:00 We identify a drop in the number of requests per minute processed on osc-secnum-fr1, we wrongly assumed that this may be linked to customers using resolver 1.1.1.1. However, the drop is very sudden, so we decide to investigate further to make sure there isn’t a more serious problem on our side.
12:06 We detect that we are indeed impacted by a problem that affects our front-end load balancers on osc-secnum-fr1. Affected load balancers have lost the list of available backend servers. Traffic is no longer being processed on the 2 concerned servers. These servers host 2 of the 4 public IPs available on this region.
12:17 An emergency fix is deployed to prioritize the use of 8.8.8.8 as a DNS resolver instead of 1.1.1.1.
12:19 Traffic processed on osc-secnum-fr1 returns to its nominal baseline.
12:36 Cloudflare reports on its status page that a fix has been deployed on their side and that DNS resolver 1.1.1.1 is fully functional again.
12:48 We update our status page to indicate that the incident concerning DNS resolver 1.1.1.1 has been resolved by Cloudflare.

Impact

On osc-secnum-fr1, 50% of incoming traffic was not redirected to applications, generating errors on the browser side.

On osc-fr1, no impact for our customers.

Communication

When we performed the incident review, we noticed that our communication through our status page was incomplete and misleading as we did not mention that the platform was also impacted by the Cloudflare incident. This was a missed step in our incident process. To ensure that this could not happen again, we are working on improving our tooling system to prevent those miscommunications from happening again.

Immediate Actions Taken

  • Emergency workaround: set 8.8.8.8 as our main DNS resolver for the service discovery process instead of 1.1.1.1. Nominally, 1.1.1.1 had first priority, while 8.8.8.8 had secondary priority.
  • A fix was deployed on October 4, at 15:00 to prevent updating the router list on a load balancer when the service discovery returns an empty list.

Actions in progress

  • Improve supervision to identify any front-end server failures more quickly.
  • Improve our incident management tooling to prevent future miscommunication.
  • Improve our DNS failover techniques to better handle SERVFAIL answers.

Financial compensation

This incident may have caused up to 50 minutes of interruption on your applications, which is beyond the range of our 99.9% monthly SLA for applications using 2 containers or more .

Customers having been impacted are invited to contact us for a redeem using the support.

Changelog

2023-10-20 : Initial version


Suggest edits

SPM-2023-001 - Partial unavailability of hosted applications on osc-secnum-fr1

©2024 Scalingo