On June 30, 2017 we had an extended deployment time of roughly 45 minutes for our reference server because of a couple of problems with one of the data migrations. We implemented a new feature, user notifications via RSS, that included a migration of data in our database. This migration was broken, causing this deployment to go terribly wrong.
The frontend team afterward met to do a post-mortem to identify the problems, solutions and possible take aways for the future. This is the first post-mortem meeting we held, hopefully but not likely the last. Here goes the report.
We deployed the first bits of the user-RSS feature, creating the
model in our database. A model that uses
single table inheritance
which stores the model name in a type attribute in the database.
A reoccurring job created a couple of hundred entries in the database for this model.
On Friday we deployed another code change which among other things renamed the class to
Notification::RssFeedItem (from plural to singular). However,
the included migration did not update the type column to match the new class
name. Which made the migration fail during the deployment.
We investigated immediately the cause and:
The migration failed again because another change had happened to the
Notifications::RssFeedItem class. The class before had a direct
Group which was refactored to be
a polymorphic association.
However the direct associations were still used in
The migration still failed. Because the code changed the attribute event_payload
to be a serialized
hash, which caused the
save operation in the migration to fail, because we already had stored data as
string type in the event_payload attribute which caused a type mismatch.
hashtype for the attribute and ran the migration again.
The migration ran through successfully but after booting up the service we
noticed we had many SQL lock timeouts related to the
Notifications::RssFeedItemin the reoccurring job responsible for that, to no avail.
After those changes build.opensuse.org was back to be fully operational after around 40 minutes.
Obviously we are in need of more automated testing of migrations with data. A couple of things we are going to try are:
Another area we have to get better in is tracking of monkey patches that are applied to production. To make sure they are included in the deployment or are re-applied afterwards. Here are some ideas how to do that:
We also had a couple of general ideas.
We are sorry for the downtime we have caused with this for our users but we hope you trust us to do better in the future. We already have implemented a couple of the proposed improvements and we are working on more to come!