SPM-2025-003 Partial Outage in osc-fr1 due to DDoS Attack

TL;DR

On Saturday, August 2, 2025, starting at 19:42 CEST, a large and unexpected DDoS attack targeted one of our public IP addresses serving the web hosting layer of our PaaS platform in the osc-fr1 region.

This led to significantly degraded accessibility for 33 minutes, with full recovery achieved within 50 minutes.

Mitigations and playbooks developed from previous incidents allowed us to respond quickly, isolate the attack, and maintain overall platform stability. They also enabled us to more precisely identify the specific type of attack after the fact. Based on this analysis, we are planning several improvements (outlined below) to better handle this attack pattern in the future.

While network access to some applications and databases was impacted, there was no compromise to the integrity or confidentiality of customer data.

What happened?

On August 2, 2025 at 19:40 CEST, we observed a sudden and severe network degradation in the osc-fr1 region, with up to 70% packet loss. This caused widespread unavailability of customer applications hosted on the platform. Many of our internal probes, which continuously monitor the health of the platform, began triggering alerts in rapid succession. This led to the immediate intervention of an on-call operator.

A DDoS attack was suspected, but the detection tools we had previously implemented for SYN flood scenarios did not allow us to clearly identify the source IPs involved in this case. We will come back to this point later in the analysis.

As we continued our investigation and focused on the load balancer servers responsible for handling inbound TCP traffic to web applications hosted on the platform, we identified one node exhibiting abnormally high (though not critical) CPU usage. At 20:15, we decided to detach the public IP address associated with this node (148.253.75.120). This action led to an immediate and significant improvement in platform accessibility. This led us to conclude that the attack was specifically targeting this particular public IP address.

At that point, a small portion of traffic (~25%) was still affected due to a routing configuration involving a now-detached ingress IP. Between 20:15 and 20:28, platform checks were performed. The public IP 148.253.75.120 was reattached at 20:28 without further incident, indicating that the attack had ceased. Packet loss dropped back to 0%, and all alerts resolved.

Once access was fully restored, we reviewed the top traffic sources on the impacted node. A range of IP addresses from an IoT network provider stood out due to a high volume of incoming TCP SYN packets. It was deemed suspicious and subsequently blocked at 21:37. The change took up to one hour to propagate across the platform.

Meanwhile, post-incident recovery actions were underway. We performed a thorough check of the entire platform and all our monitoring tools to ensure that all systems had returned to a healthy state and that no residual issues remained. Some applications required manual restarts to recover from persistent HTTP 5xx errors, as identified through internal monitoring.

Post-Incident Analysis

Follow-up analysis on August 4 confirmed that this wasn’t a typical SYN flood, but rather an ACK-PSH flood, a type of DDoS attack we hadn’t encountered before. The packets didn’t match the usual patterns we monitor for: they lacked expected session flags and weren’t showing up in our dashboards due to how our logs were parsed. As a result, identifying the source of the attack was trickier than usual and led to some legitimate traffic being mistakenly flagged as suspicious. Specifically, a range of IP addresses from an IoT network provider, which had been blocked due to a high number of initiated TCP sessions (TCP SYN packets), was later determined to be legitimate application traffic. As a result, the range was unblocked at 12:00 CEST on the same day. Our evaluation criteria for identifying DDoS sources had been biased by our experience with previous attacks.

That said, the monitoring work we had done after previous DDoS incidents still paid off. Even though our tools weren’t initially designed to catch this kind of traffic, they gave us enough visibility to eventually piece things together and understand what was going on. This incident showed us that we’re on the right track, but also that our detection needs to evolve to cover a wider range of attack types.

We are planning several follow-up actions (detailed below) to improve detection and mitigation of similar attack patterns in the future.

Timeline of the incident

All times given are in CEST (UTC+2)

Saturday, August 2, 19:42 Start of the attack.
Saturday, August 2, 19:47 Multiple alerts are triggered by our monitoring probes, indicating a major platform disruption.
Saturday, August 2, 19:50 First responder manually escalates to on-call to request additional support.
Saturday, August 2, 20:15 Public IP address 148.253.75.120 is detached from the platform.
Saturday, August 2, 20:16 Platform stability is restored.
Saturday, August 2, 20:30 Public IP address 148.253.75.120 is re-attached to the platform.
Saturday, August 2, 20:31 The platform is fully operational again, HTTP requests are being processed correctly.
Saturday, August 2, 20:47 A range of IP addresses from an IoT network provider is identified as emitting a high volume of TCP SYN packets.
Saturday, August 2, 20:59 Applications generating an unusually high number of HTTP 500 errors after the incident are identified and restarted.
Saturday, August 2, 21:30 Inbound traffic from the previously identified IP addresses is being blocked due to suspicions that it may be part of the attack.
Monday, August 4, 12:00 The IP range, that was previously identified as suspicious, is unblocked after confirming it was not the source of the attack.

Impact

  • On osc-secnum-fr1, no impact for our customers.

  • On osc-fr1, Access to applications was severely impacted for 33 minutes (from 19:42 to 20:15).

Immediate Actions Taken

  • The public IP address 148.253.75.120 targeted by the attack was detached from the platform in order to preserve the overall stability of the system.
  • Restart of applications that had been generating an unusually high number of HTTP 500 errors since the incident.

Actions in progress

  • A redesign of the platform’s overall architecture is underway to achieve greater resilience against this type of incident, with a more segmented and distributed approach. This project has been in progress for several months and continues as part of our long-term infrastructure improvement efforts.
  • Evaluation of internal solutions and configurations that can be deployed within our infrastructure to strengthen protection against large-scale DDoS attacks.
  • Improve our monitoring tools to gather more metrics on ingress network traffic, including statistics on the number of TCP RST packets we send, as well as sampling of ICMP traffic, which can also be used as a vector for DDoS attacks.

Changelog

2025-08-08 : Initial version


Suggest edits

SPM-2025-003 Partial Outage in osc-fr1 due to DDoS Attack

©2025 Scalingo