SPM-2026-001 Performance Degradation During Copy.fail Mitigation

TL;DR

On April 30, 2026, both osc-fr1 and osc-secnum-fr1 regions experienced progressive performance degradation starting around 14:30 CEST, escalating to severe disruption from 16:15 to 18:00 CEST during the deployment of mitigations for the Copy.fail vulnerability (CVE-2026-31431). The impact was uneven across the platform: some applications experienced minor slowdowns, while others suffered extreme latency or complete unavailability during the critical phase.

The incident was caused by an overly aggressive node restart campaign that moved a large number of containers in a short time period, leading to severe load imbalance across application nodes and excessive swap usage. The integrity and confidentiality of customer data remained intact throughout the incident.

Eventually, we determined that complete reboot of servers was not required and we updated our mitigation process. It allowed us to reduce the impact of internal balancing and get back to nominal state faster.

What happened?

On April 29, 2026, the Xint team publicly disclosed a critical Linux kernel vulnerability known as “Copy.fail” (CVE-2026-31431), affecting all major Linux kernel versions. The vulnerability allowed local privilege escalation to root within containers, though no container escape path was known at the time.

Following our standard security incident response, we began assessing the vulnerability on April 30 at 10:00. By 10:15, we had confirmed the exploit, and by 10:30, we had validated a mitigation strategy: disabling and denylisting the algif_aead kernel module. This mitigation was successfully rolled out across all nodes by 13:00.

As an additional precautionary measure, we decided to perform emergency restarts of all applications hosting and deployments servers on both regions. We posted a maintenance notice on our status page and expected only minor slowdowns during the operation.

The restart campaign began at 12:30 and utilized our node draining process to migrate containers between nodes. This process works by taking all containers running on a target node and moving them using our zero-downtime deployment strategy. Once a node is empty, it gets restarted, and the process moves to the next node. To speed up the operation, we ran this process on multiple nodes simultaneously.

Initially, the platform handled the migrations acceptably, with only the anticipated minor slowdowns. However, an unforeseen issue began to develop: our scheduler has protective mechanisms that temporarily exclude nodes from receiving new containers when too many containers are starting up at once or when container start operations fail. Additionally, the scheduler attempts to distribute containers from the same application across multiple nodes for resilience (for example, spreading 50 web containers across different nodes rather than concentrating them).

As the draining process ran on multiple nodes simultaneously, freshly restarted nodes became the preferred targets for container placement. However, the sudden influx of containers starting on these nodes triggered the scheduler protection mechanisms, causing them to be excluded. This forced the scheduler to place containers on the remaining unprocessed nodes instead. Combined with the distribution logic trying to spread workloads across available nodes, the unprocessed nodes progressively became overloaded. This created a cascading effect where each wave of migrations made the problem worse.

Around 14:30, performance began to noticeably degrade as the cumulative effect of container migrations started to create load imbalances across the cluster.

By 16:15, the situation had become critical. We began receiving customer complaints through our support team reporting major service degradation. Applications were experiencing severe latency, and many were timing out completely. At 16:18, we immediately paused the node restart campaign to stop further migrations. At 16:22, we opened a public incident on our status page and began investigating the root cause.

Our analysis revealed that multiple application nodes were under significant load and consuming more swap memory than usual. The aggressive node draining process had moved large numbers of containers between nodes in a short time window, and the container scheduler had not always selected optimal target nodes for placement. This resulted in severe load imbalance across the cluster. The incident response was further complicated by the fact that our own operational tools (APIs, dashboard, administration interfaces) were also impacted by the degradation, slowing down our ability to investigate and remediate.

We manually migrated containers between VMs to rebalance the load and reduce pressure on the most impacted nodes. At 16:51, we identified excessive swap usage as a key indicator and began migrating our administrative tools out of the most impacted nodes. Performance levels gradually improved, and at 18:00, the situation had improved sufficiently on both regions to close the incident on our status page.

At 19:45, we confirmed that the kernel module mitigation was sufficient on its own and that node restarts were unnecessary. We halted the restart operations at 20:04.

Why didn’t we detect it sooner?

Despite the degradation starting around 14:30, we did not immediately detect the severity of the situation. We have monitoring in place for elevated response times and 500 errors on our frontend infrastructure, but the incident did not trigger these alerts. The degradation remained below our detection thresholds, which are intentionally set high to filter out noise from sources like load tests and poorly optimized applications. Similarly, while we monitor CPU and memory usage on nodes, these metrics remained within acceptable ranges despite the developing load imbalance.

Load average alerts did trigger, but we initially treated these as expected behavior given the ongoing maintenance operation. We had anticipated higher than normal load averages during the node restart campaign. The progressive nature of the degradation, combined with the expectation of some performance impact, made it difficult to recognize that a critical situation was developing until customer complaints began arriving in significant numbers.

Timeline of the incident

All times given are in CEST (UTC+2)

April 30, 10:00	Security assessment begins.
April 30, 10:15	Exploit confirmed on Scalingo infrastructure.
April 30, 10:30	Kernel module mitigation strategy validated.
April 30, 11:30	Mitigation rollout begins (disabling `algif_aead` module).
April 30, 11:40	Maintenance notice posted on status page.
April 30, 12:30	Emergency node restart campaign begins as additional precaution.
April 30, 13:00	Mitigation rollout completed across all nodes.
April 30, ~14:30	Performance degradation becomes noticeable as load imbalances develop.
April 30, ~16:15	Situation becomes critical. Multiple customer complaints received.
April 30, 16:18	Node restart campaign paused immediately.
April 30, 16:22	Public incident opened on status page.
April 30, 16:51	Excessive swap usage identified. Administrative tools migrated to less impacted nodes.
April 30, 18:00	Incident closed on status page. Platform performance restored.
April 30, 19:45	Confirmed that kernel module mitigation alone is sufficient; node restarts were unnecessary.
April 30, 20:04	Node restart operations permanently halted.

Impact

On osc-fr1 and osc-secnum-fr1: Performance degradation starting around 14:30 CEST, peaking at 16:15 CEST with severe service disruption. Applications experienced extreme latency, timeouts, and in many cases complete unavailability during the peak. Performance progressively returned to nominal levels by 18:00 CEST.

Immediate Actions Taken

Immediately paused the node restart campaign at 16:18 to stop further migrations.
Manually migrated containers between nodes to rebalance cluster load.
At 16:51, identified excessive swap usage and migrated administrative tools out of the most impacted nodes.
Permanently halted node restart operations at 20:04 once we confirmed they were unnecessary.
Deployed new monitoring alerts to detect when our infrastructure nodes are under excessive load, allowing us to identify similar situations much earlier.

Actions in Progress

A redesign of the platform’s overall architecture is underway to achieve greater resilience, better scalability, and improved fault isolation. As part of this broader effort, we are moving our operational tools (APIs, dashboard, administration interfaces) to dedicated infrastructure separate from customer workloads. This will ensure that when platform issues occur, our ability to investigate and resolve them is not hindered. This project has been in progress for several months and continues as part of our long-term infrastructure improvement efforts.
We will implement better visibility into server memory and swap usage to catch resource exhaustion issues before they become critical.
We will develop automated tools to rebalance workloads across our infrastructure more efficiently. This includes improvements to how our system decides where to place applications during maintenance operations, and better documentation of incident response procedures.
We will improve our incident management tools to allow faster response times.

Changelog

2026-05-11: Initial version

Last update: 11 May 2026 Suggest edits