Reorganizing a data model

09 Jun 2014

About six years ago I started a military reading list web site. It still only has around a dozen models, so it's a small system. But some of the fundamental data model ideas I started with turned out to not work well, and I had to change things after all this time.

A military reading list is a bunch of books that the organization thinks is relevant to folks doing the job. Some organizations come out with new revisions once a year, others release an initial revision and never update it, and other update their list once every few years. When I started the site, I modeled a reading list as having books through a "selections" model since some books appear on many different lists. When a new revision would come out, I would figure out what books had been added and what had been removed, add a "revision detail" record for each change, and then calculate the current list of selections from that list of revision details. I thought of it kind of like a source control system where there's some initial thing and then a series of diffs.

The problem is that this data model is inconvenient when displaying what people what to see. If a list has had five revisions over the past five years, people are not really interested in what changed between revisions two and three; they just want to see what was on the 2011 revision of the list. To show that on a page I had to apply the series of intervening diffs. This could be cached, of course, but it was complicated and felt wrong. I had stored the data in a way that made it hard to work with. Furthermore, even the original idea wasn't much good because most of the lists have all new selections. So most revisions consisted of removals for everything that had been on the list and additions for all the new selections rather than just a couple of removals and additions as I had expected. It also meant that backfilling old revisions was really tricky. If I had the 2010 revision and the 2012 revision but had missed the 2011 revision, adding that would be a real brain-bender.

So I changed the data model around so that each revision, rather than each reading list, has many selections. With this change, displaying a particular revision was trivial, and displaying the current reading list is just a matter of showing the most recent revision. And if I wanted to add an old revision I could just add it without relying on diffing from other revisions. That's all nice, but it meant that I had to move a core association from one model to another. So I had to add a new foreign key column, migrate the data, remove the old one, and change all the code to work as the change was getting rolled out.

I can't think of a clear narrative for lessons learned, so I will fall back to that mainstay of disorganized writers, a list:

When doing data migrations, or Rake tasks that munge data, wrap the outer loop in SomeModel.transaction do .. end. If there's an exception halfway through, everything will get rolled back and you can see what happened.
Have an easy and automated way of dumping the production database and loading it in an environment where you can experiment. In this way I discovered things like "not all lists have sections" and "some selections don't belong to a reading list" and other occurrences of bad data. This also let me test out data migration tasks before running them in production. In my case the database is small enough that I can do a dump and load with a cap task and it finishes in a reasonable amount of time.
Think through data migration rollbacks. If I create a few hundred selections, and it turns out my algorithm is wrong, how will I remove them? I didn't do this and so had to figure out something on the fly. My data set is small enough that I was able to query even on unindexed columns and delete the new data, but it would have been less stressful if I had had something in mind beforehand.
Data constraints are nice. They help ensure that bad data doesn't get persisted. They are a pain when doing a data model change, because I had to do migrations to drop nonnull constraints and whatnot, but those migrations aren't hard to write. And the very act of typing out a migration to remove a constraint helped me think through the data model changes.
I was afraid to get started on the change because I thought changing the app I use to add new revisions would be a tremendous pain. As it turned out, it wasn't bad at all. I think this is partially due to it now being an Angular app; there was just not a lot of code to deal with. And fortunately I hadn't written many tests for the admin app, so that saved some time :trollface:.
At one point I sort of lost the thread and stopped working on this change for a week or so because I wasn't sure what to do next. That worked out ok, though, because I was doing things in small steps and making sure everything worked along the way. This was more effort than doing everything in one fell swoop, but it was nice to be able to stop work for a bit and know that I had already made some progress and nothing on the site was broken.
Before I got started I wrote up a paragraph or two explaining to my future self why I was making this change. I felt a little silly when writing this initially, but I referred to it and updated it several times as I was going through the process. It helped me keep the end in mind and re-reading it helped me validate that I was on the right track.

I had lunch with Allan Bunch the other day and we were talking about this change. He does a lot of NoSQL work, so we were trying to figure out if this change was harder because the data is in PostgreSQL tables rather than Mongo or Couch or some such. That might be true... certainly some of the pain involved was around constraints and adding/removing columns and such. I've been using relational databases for so long - and haven't done any NoSQL thing of note - so it's hard for me to imagine what this change would have involved in something other than tables and columns. But there were certainly moments when I felt like I was breaking a structure which had really crystalized and that fiddling with it would make it all fall to pieces.

At my day job we occasionally do data model changes and run into all the problems listed here. And we run into a bunch of other problems due to the amount of data and the rate at which new data is being added. Fortunately we have a pretty solid infrastructure and processes set up so we can still make changes; just takes a little more thought.