QStash: Degraded performance in request processing and event logs

Incident Report for Upstash Status

Postmortem

Product: QStash

Incident Summary

Due to high load, the volume of QStash event logs reached to a point which caused latency in the underlying data store operations.
Event log creation was slowed down and lead to performance degradation in QStash request processing.
In order to resolve the performance degradation in QStash requests, event logging module was turned off temporarily.
After deploying a hot fix and configuration changes, we eventually turned on event logging and system went back to stable state again.

Root Cause

At 07:15 UTC we received alerts on the performance degradation and started the investigation.
We discovered long running queries for synching event logs from main QStash servers to QStash event server.
In order to resolve performance degradation, we turned off event logging functionality as an immediate action.
This action turned the performance back to normal levels for QStash requests but left event log processing disabled.
We deployed a hotfix during the day to remove some redundant calls and alleviate the impact.
Around 16:20 UTC, we observed another performance degradation on QStash requests due to a load increase, and disabled event log processing again.
In the following hours, we deployed a configuration change to relax the job interval durations for event log tasks and turned on event logging again.
This configuration change helped to resolve the performance issues without any further issues.

Impact

During the problematic timeframes, when the slow event log processing was observed, QStash requests experienced high latency and caused timeouts for customers.
No events were lost. Duplicate event deliveries were observed due to a number of restarts during the incident.

Resolution

Improvements are applied to the event logging module to prevent the same issue from happening again.
Also, we have planned to upgrade underlying disks to stronger models.

Posted Apr 03, 2025 - 15:57 UTC

Resolved

QStash service and Event logs are fully functional without any remaning issues.
Posted Apr 02, 2025 - 21:12 UTC

Update

Monitoring:
Main QStash service is back to normal.
Event logs service is back online but events will be lagging a few mins.
Posted Apr 02, 2025 - 20:20 UTC

Monitoring

Main QStash service is back to normal.
Event logs are still temporarily unavailable.
Posted Apr 02, 2025 - 17:33 UTC

Update

We are continuing to investigate this issue.
Posted Apr 02, 2025 - 16:33 UTC

Update

We are continuing to investigate this issue.
Posted Apr 02, 2025 - 16:29 UTC

Investigating

We are currently investigating this issue.
Posted Apr 02, 2025 - 16:21 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Apr 02, 2025 - 08:15 UTC

Investigating

We are currently investigating this issue.
Posted Apr 02, 2025 - 07:15 UTC
This incident affected: QStash.