SPM-2025-002 Service Interruption

TL;DR

On April 7th, 2025, the osc-fr1 region was unavailable from 14:05 to 19:50 UTC due to a network issue in our infrastructure provider’s eu-west-2 region. A virtual machine triggered a multicast traffic flood, saturating internal network interfaces and making ~50% of the VMs of that particular region unreachable.

The osc-secnum-fr1 region was unaffected. The Disaster Recovery Plan was not triggered, as the issue was identified as recoverable and data-safe.

While this incident affected network access to your application and databases, the integrity and confidentiality of your data and applications were maintained.

Our provider has implemented immediate mitigations and plans a full fix by the end of June 2025. Scalingo is also strengthening its architecture with better fault isolation across multiple Availability Zones.

If you have any questions or would like to discuss this further, please don’t hesitate to contact us.

What happened?

The incident originated from a network issue in the eu-west-2 region of our infrastructure provider. A virtual machine, using a specific broadcast protocol, triggered a traffic flood that was amplified by the VXLAN multicast mechanism. This caused an excessive volume of multicast packets to be distributed across multiple Data Centers, saturating internal physical interfaces on several hypervisors.

As a result, virtual machines hosted on the affected hypervisors became unreachable. This led to nearly 50% of VMs in the three Availability Zones experiencing severe connectivity issues, effectively making Scalingo osc-fr1 region inaccessible during the incident window.

What is our infrastructure provider doing to prevent recurrence?

Our provider is taking the following actions:

Immediate traffic storm limitations have been enforced at the router level to reduce the risk of similar issues.
Tina OS is the cloud orchestrator used to manage Outscale infrastructure. With the new version of Tina OS (planned for June 2025), multicast will be limited on virtual switches. A complete implementation will follow the planned **Fabric network upgrade, currently in progress.

What is Scalingo doing to improve resilience?

We are actively redesigning Scalingo’s infrastructure to take advantage of a more robust architecture, based on multiple VPCs distributed across different Availability Zones (AZs). These efforts aim to better isolate failures and improve fault tolerance.

That said, despite these improvements, we remain partially exposed to complete failures in a given infrastructure provider region. Our goal is to continuously reduce this risk.

Why was osc-secnum-fr1 not affected?

The osc-secnum-fr1 region is deployed in a different provider region (cloudgouv-eu-west), which was not impacted by this network issue. This architectural separation helped to isolate the incident.

Why was the Disaster Recovery Plan (DRP) not triggered?

Activating the DRP is a serious decision with significant operational implications. It is only initiated when we are certain that the situation is irrecoverable. In this case, we quickly identified that the issue was limited to network connectivity and that data integrity was not at risk.

Changelog

2025-06-26 : Initial version

Last update: 26 Jun 2025 Suggest edits