Safer JSON munging

09 Aug 2018

At work we have an ETL process where we get a CSV from a partner and put it in a database as JSON. Sometimes the column names are a little off and we have to rename things. Also, sometimes we need to derive new values from the data we've imported, and sometimes we delete some of the unnecessary key/value pairs. So basically there's a good bit of munging going on there.

Initially I was writing little one-off scripts to do this stuff. For each record, grab it, munge the JSON, save it. But that was kind of stressful since I was always one typo away from messing up the data and having to re-import or restore from the audit records. Unit tests help, of course, but having to write a new script each time was tedious too.

Thinking more about the problem, it felt less like an imperative "if key x is present then do y" and more like a series of instructions, or transformations, that we were running on the data. This felt nicer, more functional; send in a hash, get back another hash with a slight modification. Also, why spend 15 minutes writing a script when you can automate it in a day? This line of thinking resulted in a little flurry of classes like this:

  # h = {a: 2, b: 3}
  # CopyInst.new(:a, :z).call(h)
  # {:a => 2, :b=>3, :z=>2} # new h
  class CopyInst
    attr_reader :from, :to
    def initialize(from, to)
      @from = from
      @to = to
    end
    def call(hash)
      hash[to] = hash[from]
    end
  end

It's not quite functional because it modifies the hash argument rather than dup'ing. But each little instruction (CopyInst, RenameInst, and a couple of other business-specific ones) was easy to understand and test. And after instantiating a series of instructions you just iterate over them and call each one in turn:

def munge(hash)
  instructions.each do |inst|
    inst.call(hash)
  end
  hash
end

And out the other side comes a cleaned up hash suitable for to_json-ing.

Expressing these transformations as a series of instructions - a 'program' - suggests some interesting possibilities for the thing that is executing them - the 'virtual machine'. We could validate those instructions so that each one verifies it can be run on the target hash; check for missing keys for RenameInst, checks for already existing keys for the CopyInst, and so on. We could coalesce instructions to reduce the number of method calls. We could detect no-op sequences (copy 'a' to 'b' followed by copy 'b' to 'a') or common subexpressions. If there were non-technical users you could imagine defining a little user interface to let end users express whatever changes they wanted in terms of these instructions, and then save off that 'program' and run it. And there's probably an undo stack in there somewhere as well. All kinds of nifty compiler-ish things!

This experiment turned what was drudgery into an interesting and fun exercise. Always a win!