Post-mortem: Service Degradation in OBS
Between February 15th and 18th, the Open Build Service (OBS) experienced intermittent service degradations.
Users experienced sporadic latency across various workflows (with delayed jobs involved), leading to periods of total service unavailability for most users.
Those were the consequences of a huge traffic spike. We want to give you some insight into what happened and what we plan to do to handle similar circumstances in the future more efficiently.
Detection
The dashboards on BuildOps monitoring system showed how delayed jobs were piling up over the weekend.
Some users reported that the changes tab on the request pages was taking too long to load.
Our monitoring system also alerted about increased traffic and hence a decrease in performance.
Root Cause
After our investigations, we conclude that there was a high volume of HTTP requests to our servers from a certain range of IPs during those days.
Firstly, the only symptom was the long-lasting delayed jobs, especially BsRequestActionWebuiInfosJob on the quick queue. However, at some point, the traffic was so high that it reached the limit of Passenger instances and the whole service became unavailable.
Resolution
We blocked multiple IP address ranges in one of our firewalls.
Action Items
- Improve response time for identifying and blocking sources by introducing alerts for “abnormal” (not only “fatal”) amounts of traffic, and about the delayed job wait time / queue size.
- Introducing traffic prioritization. For example, giving the
BsRequestActionWebuiInfosJobthat are scheduled by authenticated users higher priority in the queue. They should have more priority than legitimate bots (search engines, AIs, etc.) and, of course, illegitimate bots to fill the queue. - Moving
SyncUpstreamPackageVersionJobout of thequickqueue.
Lessons Learned
Early detection and action is key to avoid service disruption.
Timeline (All Time in UTC)
- 2026-02-15 10:24 BuildOps informed about delayed jobs piling up
- 2026-02-16 11:10 Started investigations to detect the delayed jobs involved
- 2026-02-16 14:10 Setting up to 3 extra
quickqueues - 2026-02-16 14:21 Decided to wait until the next day to evaluate the impact
- 2026-02-17 10:02 Setting up to 9
quickqueues - 2026-02-17 12:42 Received alerts from our monitoring system
- 2026-02-17 13:08 Investigation led us to a bot that sent a high volume of HTTP requests, from hundreds of different IP addresses, in a short period of time. A.K.A. DDOS
- 2026-02-17 13:13 Gathered the responsible IP addresses in order to block them
- 2026-02-17 13:27 MR opened for our salt repository to indicate the IP range involved
- 2026-02-17 13:38 MR merged and new firewall rules deployed
- 2026-02-17 14:06 The volume of traffic started to settle
- 2026-02-17 14:35 Volume of traffic went back to normal levels. Incident resolved
- 2026-02-18 10:36 Tracked up to 30K BsRequestActionWebuiInfosJob in the quick queue. Those jobs were cleared