Regional AWS eu-west-1 Cluster Performance Degradation Issue

Incident Report for Upstash Status

Postmortem

Incident Summary

During a maintenance update to the regional Upstash Redis databases in AWS eu-west-1, several databases hosted in that region has unnecessarily triggered a full synchronisation between their primary and backup replicas.

Root Cause

A full synchronisation is the invalidation of the whole data in the target replica and starting a fresh re-population from the source replica. Under normal circumstances, full synchronisation is required only in cases where the data integrity is lost in one of the replicas, which was not the case here.

Impact

This incident impacted the performance of regional databases on AWS eu-west-1 only. Full synchronisation has caused a very high CPU load and caused a performance degradation on some of the databases that has a replica in this region. Moreover, our system throttles the databases that are going through this operation to allocate more CPU to the synchronisation to finish it sooner.

No data or consistency has been lost.

Resolution

As a quick remediation, we have unthrottled affected databases on 15:06UTC and enabled more throughput, however high latency has still been observed until the full synchronisation is completed on 21:23UTC.

A fix has been prepared to avoid this unnecessary full synchronisation on regional databases, and will be deployed shortly.
This issue is not present on Upstash Global databases, which is our new generation infrastructure and is now our default offering. We will reach out to our regional users on how to migrate to Upstash Global going forward.

Posted Feb 26, 2025 - 12:37 UTC

Resolved

This incident has been resolved.

Posted Feb 25, 2025 - 21:23 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 25, 2025 - 20:54 UTC

Identified

Regional AWS eu-west-1 cluster is experiencing performance degradation, and we are adding more resources to the cluster.

Posted Feb 25, 2025 - 14:03 UTC

This incident affected: Redis Regional (AWS EU-WEST-1).