Networking issues in use1

Postmortem

February 10, 2026 at 1:34 AM

Postmortem

February 10, 2026 at 1:34 AM

Incident Overview

Date: January 30, 2025
Affected AWS region: us-east-1
Duration: partial outage 6.5 hours, total outage 1 hour

Summary

On January 30, a routine infrastructure upgrade to our Kubernetes networking layer caused a service disruption in one of our primary cloud regions. What began as a partial outage escalated into a full outage affecting data pipelines for customers in the us-east-1 region. We sincerely apologize for the impact this had on your operations.

Service was fully restored the same evening after our team identified and resolved the underlying issue. It was a version mismatch between critical networking components in our cluster.

Impact

We understand how important reliability is to your business, and we are deeply sorry for the disruption. During the outage:

Service was unavailable in the us-east-1 region for 7.5 hours.
A small number of affected customers experienced data replication interruptions, which our team worked to remediate promptly.

We recognize that any downtime is unacceptable, and we take full responsibility for this incident.

What happened

Our engineering team was performing a planned upgrade to a networking add-on (vpc-cni) to increase capacity in our Kubernetes infrastructure. During the upgrade:

The change was applied successfully in a non-production environment but encountered compatibility issues in the production cluster.
A rollback was not possible because the previous component version was no longer supported by the underlying platform.
The issue was compounded by version mismatches across several interdependent networking components, which made diagnosis more complex than expected.

Our team identified the root cause as an outdated networking component (kube-proxy) that was incompatible with the rest of the upgraded stack. Once the fix was applied, service was restored.

What we're sorry about

We want to be transparent; this incident was avoidable. We moved forward with a change that, in hindsight, required more thorough validation and coordination. Specifically:

The production environment differed from the test environment in ways we did not fully account for.
We did not have sufficient safeguards in place to catch the version incompatibility before it caused an outage.
Our incident response process was slower than it should have been, and we did not escalate to our cloud provider's engineering team quickly enough.

We are sorry for the disruption and for falling short of the reliability you expect from us.

Our plan going forward

We are actively carrying out a comprehensive set of improvements to prevent incidents like this from happening again:

Stronger change controls

We have strengthened our change management process so that all planned infrastructure migrations require formal approval and a minimum of two engineers working together to apply changes.
Non-reversible changes are restricted to business hours (Monday through Thursday) to ensure full team availability and vendor support coverage.

Improved testing and validation

We are auditing and standardizing component versions across all of our clusters to eliminate hidden drift between environments.
Production and non-production clusters are being aligned to ensure test results are representative.

Faster incident response

We are formalizing a war room procedure with clearly defined roles: a primary engineer making changes, a secondary approving them, an incident commander coordinating, and a communications lead keeping customers informed.
We are establishing clear severity definitions and escalation timelines so that vendor support is engaged earlier when needed.

Infrastructure resilience

We are working toward running multiple clusters per region so that workloads can be migrated if a single cluster is compromised.
We are building the capability to move customer pipelines between clusters with minimal disruption.
We are proactively engaging our cloud provider to review and approve infrastructure upgrade plans before they are executed.

We value your trust and are committed to earning it back through these concrete actions. If you have any questions or concerns about this incident or our remediation plan, please do not hesitate to reach out to your account contact or our support team.

Resolved

January 31, 2026 at 8:24 AM

Resolved

January 31, 2026 at 8:24 AM

We’ve identified the root cause of the DNS instability: an out-of-order VPC CNI upgrade led to an undocumented version mismatch, which caused conflicts in our DNS routing rules. The issue has now been resolved, and we are closely monitoring all pipelines to ensure continued stability.

Update

January 31, 2026 at 7:42 AM

Update

January 31, 2026 at 7:42 AM

We are continuing to see intermittent DNS errors. The issue has been escalated with AWS, and an AWS engineer is joining the investigation shortly. We’ll share another update as we learn more.

Update

January 31, 2026 at 5:14 AM

Update

January 31, 2026 at 5:14 AM

We are now working with AWS to diagnose the partial DNS failure.

Update

January 31, 2026 at 4:39 AM

Update

January 31, 2026 at 4:39 AM

We’ve identified intermittent DNS issues on a specific node and are actively working to resolve them. We’re working to restore full reliability.

Monitoring

January 31, 2026 at 3:26 AM

Monitoring

January 31, 2026 at 3:26 AM

There's a CNI error. We are seeing some clients recover and are still working towards a full recovery.

Identified

January 31, 2026 at 12:30 AM

Identified

January 31, 2026 at 12:30 AM

A config change has affected networking for pipelines in the us-east-1 data plane. We are continuing to work on a fix for this incident.

Artie - Networking issues in use1 – Incident details

All systems operational

Incident Overview

Summary

Impact

What happened

What we're sorry about

Our plan going forward