SPM-2024-01 - Major network degradation on osc-fr1

TL;DR

On January 18, from 11:40 to 15:38 (CET), the platform was impacted by a network disruption that severely affected access to applications hosted on the osc-fr1 region. This incident was traced back to our infrastructure provider. The issue resulted from a maintenance task, where a bug in a network device triggered a degradation of network performance and packet loss across the eu-west-2 Outscale region.

While this incident affected network access to your applications, the integrity and confidentiality of your data and applications were maintained.

What happened?

On January 18th at 11:30 (CET), Outscale (Scalingo’s infrastructure provider since July 2019) started maintenance on their eu-west-2 region (the region used to host our osc-fr1 region). This maintenance consisted in a data transfer inside Outscale’s infrastructure.

Unfortunately, one of the network equipments used for this maintenance started to emit malformed frames on their internal network. Those frames where then picked up by the network equipment running their eu-west-2 region which resulted in packet loss on the Scalingo’s internal network.

The packet loss induced by this maintenance manifested itself on Scalingo’s network on Jan 18 11:40. The perturbation continue to grow in severity until Jan 18 11:55 where the situation became critical. At that time first alerts were sent and Scalingo’s operators were paged on Jan 18 12:10.

At first the incident response team searched for an issue inside our own network, however on Jan 18 12:16 it became apparent that the issue was not on Scalingo’s side and the Outscale support was contacted at 12:24.

On Jan. 18 at 14:15 a first diagnostic was formulated by the Outscale team. However due to the nature of the operation, it was not possible to interrupt the maintenance. Thus, Outscale quickly made configuration changes meant to contain the network flow induced by the maintenance. Those changes induced some back-and-forth in the network performance between 14:40 and 14:55 followed by a progressive improvement from Jan. 18 14:55 until Jan. 19 10:30 when the situation was back to normal.

On Scalingo’s side, the situation was back to normal on Jan. 18 15:38 when the packet loss within our internal network reached an acceptable level, allowing us to deliver stable service to our clients.

Outscale’s investigation showed that the reason for their network equipment starting to misbehave was due to a bug in the equipment software. The equipment vendor has been contacted and confirmed the bug.

Here’s a recap of the incident as seen by our monitoring:

Packet Loss on the osc-fr1 region (Jan 18 2024)

Timeline of the incident

All times given are in CET (UTC+1)

Jan 18 11:30	Outscale start maintenance on the `eu-west-2` region.
Jan 18 11:55	Alerts about network problems on the `osc-fr1` region are triggered on Scalingo side. Our operators are starting to investigate.
Jan 18 12:10	On-call alerts are triggered, the entire `osc-fr1` region is affected by this problem.
Jan 18 12:12	Scalingo activates a dedicated incident response team.
Jan 18 12:16	Our investigations show that the problem is related to the network infrastructure. Our operator contact our infrastructure provider Outscale.
Jan 18 13:15	Our response team contacts our technical account manager at Outscale to get further information.
Jan 18 14:15	Outscale implements configuration changes to help manage traffic flow.
Jan 18 14:40 - 14:55	We observe successive deterioration and improvements of network performance.
Jan 18 15:06	We receive communication from Outscale that informs us that changes have been made.
Jan 18 15:18	We observe some improvements, however we have no yet returned to a nominal situation.
Jan 18 15:38	The network loss rate is back to an acceptable level and is not perceptible for our customers. Our support team also confirms that our customers are also reporting that the situation is back to a normal state.
Jan 19 10:30	Performance level has returned to its normal baseline. The incident is considered resolved.

Impact

On osc-secnum-fr1 : No impact

On osc-fr1: Major network degradation that prevented customers and their clients from accessing their applications.

Actions in progress

On Outscale’s side:
- The bug on the network equipment that triggered the incident has been reported to the equipment vendor. The vendor confirmed the bug’s existence.
- The procedures used to plan and test such maintenance will be updated.
- Alternatives ways to perform data transfer that does not include this network equipment are being considered.
On Scalingo’s side:
- Improve our alerting rules to detect such incidents earlier. The aim is to improve reactivity if a similar situation arises in the future.

Financial compensation

This incident may have caused up to 3 hours and 28 minutes of network degradation on your databases, which is beyond the range of our 99.9% monthly SLA for applications using 2 containers or more.

Changelog

2024-02-02 : Initial version

Last update: 02 Feb 2024 Suggest edits