Post-mortem: Extended Deployment time on June 30, 2017
On June 30, 2017 we had an extended deployment time of roughly 45 minutes for our reference server because of a couple of problems with one of the data migrations. We implemented a new feature, user notifications via RSS, that included a migration of data in our database. This migration was broken, causing this deployment to go terribly wrong.
The frontend team afterward met to do a post-mortem to identify the problems, solutions and possible take aways for the future. This is the first post-mortem meeting we held, hopefully but not likely the last. Here goes the report.
We deployed the first bits of the user-RSS feature, creating the
model in our database. A model that uses
single table inheritance
which stores the model name in a type attribute in the database.
Wednesday to Friday
A reoccurring job created a couple of hundred entries in the database for this model.
On Friday we deployed another code change which among other things renamed the class to
Notification::RssFeedItem (from plural to singular). However,
the included migration did not update the type column to match the new class
name. Which made the migration fail during the deployment.
We investigated immediately the cause and:
- temporarily renamed the class again to `Notifications::RssFeedItem` to be able to rename the `type` attribute in the rails console
- after the rename succeeded we moved the class back to it's current name and ran the migration again
The migration failed again because another change had happened to the
Notifications::RssFeedItem class. The class before had a direct
Group which was refactored to be
a polymorphic association.
However the direct associations were still used in
- we fixed the migration to use the new association type and then re-ran it.
The migration still failed. Because the code changed the attribute event_payload
to be a serialized
hash, which caused the
save operation in the migration to fail, because we already had stored data as
string type in the event_payload attribute which caused a type mismatch.
we removed the
hashtype for the attribute and ran the migration again.
The migration ran through successfully but after booting up the service we
noticed we had many SQL lock timeouts related to the
- we restarted the obsapidelayed service, to no avail.
- we disabled the generation of
Notifications::RssFeedItemin the reoccurring job responsible for that, to no avail.
- we then noticed that reoccurring job was running that was previously disabled by a monkey patch. This patch was overwritten by the deployment. After applying the patch again and a restart of the reoccurring jobs, the timeouts stopped.
After those changes build.opensuse.org was back to be fully operational after around 40 minutes.
Obviously we are in need of more automated testing of migrations with data. A couple of things we are going to try are:
- if a deployment includes data migrations we run them on a production DB clone. For making this easier we want to write a rake task to import data from production into development.
- add a [migration] label to pull requests that includes migrations. The submitter has to OK that he tested the migration before merging the change.
- research how we can write migration tests with RSpec. There is no clear path how to do this for Ruby On Rails but there are many ideas floating around ( migration tests, separating data migration from database changes etc.)
Another area we have to get better in is tracking of monkey patches that are applied to production. To make sure they are included in the deployment or are re-applied afterwards. Here are some ideas how to do that:
- always apply the [Monkey Patched] label to the pull request if you have applied it in some way to production
- track monkey patches in /etc/motd of the server
We also had a couple of general ideas.
- all people having developed a migration in the deployment have to be present during it
- only submit the minimum of code-changes needed for a migration in a pull request so its easier reviewable
We are sorry for the downtime we have caused with this for our users but we hope you trust us to do better in the future. We already have implemented a couple of the proposed improvements and we are working on more to come!