Artie - Networking issues in use1 – Incident details

Networking issues in use1

Resolved
Major outage
Started 15 days agoLasted about 8 hours

Affected

Artie Dashboard

Operational from 12:30 AM to 8:24 AM

AWS US-East-1 Data Plane

Operational from 12:30 AM to 1:48 AM, Major outage from 1:48 AM to 2:30 AM, Operational from 2:30 AM to 8:24 AM

Updates
  • Postmortem
    Postmortem

    Incident Overview

    • Date: January 30, 2025

    • Affected AWS region: us-east-1

    • Duration: partial outage 6.5 hours, total outage 1 hour

    Summary

    On January 30, a routine infrastructure upgrade to our Kubernetes networking layer caused a service disruption in one of our primary cloud regions. What began as a partial outage escalated into a full outage affecting data pipelines for customers in the us-east-1 region. We sincerely apologize for the impact this had on your operations.

    Service was fully restored the same evening after our team identified and resolved the underlying issue. It was a version mismatch between critical networking components in our cluster.

    Impact

    We understand how important reliability is to your business, and we are deeply sorry for the disruption. During the outage:

    • Service was unavailable in the us-east-1 region for 7.5 hours.

    • A small number of affected customers experienced data replication interruptions, which our team worked to remediate promptly.

    We recognize that any downtime is unacceptable, and we take full responsibility for this incident.

    What happened

    Our engineering team was performing a planned upgrade to a networking add-on (vpc-cni) to increase capacity in our Kubernetes infrastructure. During the upgrade:

    1. The change was applied successfully in a non-production environment but encountered compatibility issues in the production cluster.

    2. A rollback was not possible because the previous component version was no longer supported by the underlying platform.

    3. The issue was compounded by version mismatches across several interdependent networking components, which made diagnosis more complex than expected.

    Our team identified the root cause as an outdated networking component (kube-proxy) that was incompatible with the rest of the upgraded stack. Once the fix was applied, service was restored.

    What we're sorry about

    We want to be transparent; this incident was avoidable. We moved forward with a change that, in hindsight, required more thorough validation and coordination. Specifically:

    • The production environment differed from the test environment in ways we did not fully account for.

    • We did not have sufficient safeguards in place to catch the version incompatibility before it caused an outage.

    • Our incident response process was slower than it should have been, and we did not escalate to our cloud provider's engineering team quickly enough.

    We are sorry for the disruption and for falling short of the reliability you expect from us.

    Our plan going forward

    We are actively carrying out a comprehensive set of improvements to prevent incidents like this from happening again:

    Stronger change controls

    • We have strengthened our change management process so that all planned infrastructure migrations require formal approval and a minimum of two engineers working together to apply changes.

    • Non-reversible changes are restricted to business hours (Monday through Thursday) to ensure full team availability and vendor support coverage.

    Improved testing and validation

    • We are auditing and standardizing component versions across all of our clusters to eliminate hidden drift between environments.

    • Production and non-production clusters are being aligned to ensure test results are representative.

    Faster incident response

    • We are formalizing a war room procedure with clearly defined roles: a primary engineer making changes, a secondary approving them, an incident commander coordinating, and a communications lead keeping customers informed.

    • We are establishing clear severity definitions and escalation timelines so that vendor support is engaged earlier when needed.

    Infrastructure resilience

    • We are working toward running multiple clusters per region so that workloads can be migrated if a single cluster is compromised.

    • We are building the capability to move customer pipelines between clusters with minimal disruption.

    • We are proactively engaging our cloud provider to review and approve infrastructure upgrade plans before they are executed.


    We value your trust and are committed to earning it back through these concrete actions. If you have any questions or concerns about this incident or our remediation plan, please do not hesitate to reach out to your account contact or our support team.

  • Resolved
    Resolved

    We’ve identified the root cause of the DNS instability: an out-of-order VPC CNI upgrade led to an undocumented version mismatch, which caused conflicts in our DNS routing rules. The issue has now been resolved, and we are closely monitoring all pipelines to ensure continued stability.

  • Update
    Update

    We are continuing to see intermittent DNS errors. The issue has been escalated with AWS, and an AWS engineer is joining the investigation shortly. We’ll share another update as we learn more.

  • Update
    Update

    We are now working with AWS to diagnose the partial DNS failure.

  • Update
    Update

    We’ve identified intermittent DNS issues on a specific node and are actively working to resolve them. We’re working to restore full reliability.

  • Monitoring
    Monitoring

    There's a CNI error. We are seeing some clients recover and are still working towards a full recovery.

  • Identified
    Identified

    A config change has affected networking for pipelines in the us-east-1 data plane. We are continuing to work on a fix for this incident.