Post-mortem: Extended Deployment time on July 19, 2017
We did it again! Yesterday, on 19th of July 2017, we had an extended deployment time because of an issue during the deployment. Though this time it "only" took 15 minutes;-)
This sucks and that's why we want to give you some insight in what happened.
Problems/Timeline
19-07-2017
12:37 UTC – We installed the newest OBS packages from our Unstable project and ran the migrations.
12:40 UTC – Installation and migration finished. We checked the ui and everything looked normal.
12:41 UTC – A user reported that the stylesheets for his request page looked broken.
12:50 UTC – We inspected the assets stored in public/assets and noticed that there was an additional manifest file from 2015. Sprocket is using these files to know what assets (CSS, javascript, images) it should serve.
Knowing that having multiple manifests can cause issues with sprocket, we moved the additional file to a temporary directory and tried to precompile the assets again.
But precompiling the assets caused unexpected trouble due to permissions errors and were not sure either, if that really was causing the trouble.
One of the other theories we had were that a recent change in the application.css.erb caused the failure when compiling the assets.
12:55 UTC – Since we were not able to pinpoint the problem and didn't want to delay the deployment any longer, we decided to downgrade the packages.
12:57 UTC – The package downgrade finished and everything was back to normal and OBS was running.
Analyzing what went wrong
We used our test instance to verify our theory that the additional asset files were causing the breakage. It turned out that it did. And apparently we would just have to restart the server to allow Rails to pick up the right manifest file.
This morning we upgraded the packages again. The broken CSS appeared again, but this time we knew that restarting the server solves the faulty state.
Improvements
We need to find a way to prevent having additional asset manifests in our app. Since our deployments are packages based, we might have to improve our package spec. To find a good solution for that, we've created a card in our product backlog.
Resolution
We apologize for the downtime we have caused. The faulty manifest has been deleted and we look forward to find a way to prevent this from happening again. And of course we aim for improving our deployments over time.
Please bear with us!