Reorganizing a data model

09 Jun 2014

About six years ago I started a military reading list web site. It still only has around a dozen models, so it's a small system. But some of the fundamental data model ideas I started with turned out to not work well, and I had to change things after all this time.

A military reading list is a bunch of books that the organization thinks is relevant to folks doing the job. Some organizations come out with new revisions once a year, others release an initial revision and never update it, and other update their list once every few years. When I started the site, I modeled a reading list as having books through a "selections" model since some books appear on many different lists. When a new revision would come out, I would figure out what books had been added and what had been removed, add a "revision detail" record for each change, and then calculate the current list of selections from that list of revision details. I thought of it kind of like a source control system where there's some initial thing and then a series of diffs.

The problem is that this data model is inconvenient when displaying what people what to see. If a list has had five revisions over the past five years, people are not really interested in what changed between revisions two and three; they just want to see what was on the 2011 revision of the list. To show that on a page I had to apply the series of intervening diffs. This could be cached, of course, but it was complicated and felt wrong. I had stored the data in a way that made it hard to work with. Furthermore, even the original idea wasn't much good because most of the lists have all new selections. So most revisions consisted of removals for everything that had been on the list and additions for all the new selections rather than just a couple of removals and additions as I had expected. It also meant that backfilling old revisions was really tricky. If I had the 2010 revision and the 2012 revision but had missed the 2011 revision, adding that would be a real brain-bender.

So I changed the data model around so that each revision, rather than each reading list, has many selections. With this change, displaying a particular revision was trivial, and displaying the current reading list is just a matter of showing the most recent revision. And if I wanted to add an old revision I could just add it without relying on diffing from other revisions. That's all nice, but it meant that I had to move a core association from one model to another. So I had to add a new foreign key column, migrate the data, remove the old one, and change all the code to work as the change was getting rolled out.

I can't think of a clear narrative for lessons learned, so I will fall back to that mainstay of disorganized writers, a list:

I had lunch with Allan Bunch the other day and we were talking about this change. He does a lot of NoSQL work, so we were trying to figure out if this change was harder because the data is in PostgreSQL tables rather than Mongo or Couch or some such. That might be true... certainly some of the pain involved was around constraints and adding/removing columns and such. I've been using relational databases for so long - and haven't done any NoSQL thing of note - so it's hard for me to imagine what this change would have involved in something other than tables and columns. But there were certainly moments when I felt like I was breaking a structure which had really crystalized and that fiddling with it would make it all fall to pieces.

At my day job we occasionally do data model changes and run into all the problems listed here. And we run into a bunch of other problems due to the amount of data and the rate at which new data is being added. Fortunately we have a pretty solid infrastructure and processes set up so we can still make changes; just takes a little more thought.