Picking the right data structure

20 Oct 2015

Suppose you came across some code like this:

def foo
  teams = ["Yankees", "Cubs"]
  puts "There are #{teams.size} teams"
  puts "Mariners are included? #{teams.include?('Mariners')}"
end

Probably you would think that it was odd that it was using an Array for teams; order doesn't seem to be important. All the operations are supported by Set which is much more efficient if it does what's needed. So this could be replaced with:

def foo
  teams = Set.new(["Yankees", "Cubs"])
  puts "There are #{teams.size} teams"
  puts "Mariners are included? #{teams.include?('Mariners')}"
end

Now we're using constant rather than linear time operations. This doesn't matter here, but if we had a million items in the collection it would make a big difference.

Here's another example of using an Array when another type would work better:

def foo
  Team = Struct.new(:name, :city)
  teams = []
  teams << Team.new("Yankees", "New York")
  teams << Team.new("Mets", "New York")
  teams << Team.new("Cubs", "Chicago")
  teams << Team.new("White Sox", "Chicago")
  teams.select {|t| t.city == "New York" }
end

In this case order is important because clearly the Yankees are the best team of the bunch, ever, in the world. So a Set wouldn't work. But we can avoid scanning the array with select by using a Hash with the city name as the key:

def foo
  Team = Struct.new(:name, :city)
  teams = {
    "New York" => [Team.new("Mets", "New York"), Team.new("Yankees", "New York")],
    "Chicago" => [Team.new("Cubs", "Chicago"), Team.new("Cubs", "White Sox")],
  }
  teams["New York"]
end

We've replaced another linear time operation with a constant time operation.

This is more than a performance enhancement, though. It also communicates something to the next developer who looks at this code. If we use a Set, we're communicating that order doesn't matter, and that we'll never expect to be able to index into this collection. If we use a Hash, we're communicating that we expect lookups to happen a particular way.

So these are worthwhile changes. But in a largish codebase - or even a small one - how do we find these refactoring / performance opportunities? In the case of "replace Array with Set", for a given Array, we need to ensure that we're not using any methods that aren't also available on Set. So if we have an Array that's using any of these methods, we can't swap them out:

irb(main):004:0> Array.new.methods.sort - Set.new.methods
=> [:*, :[], :[]=, :assoc, :at, :bsearch, :combination, :compact, :compact!, :concat,
:delete_at, :each_index, :fetch, :fill, :index, :insert, :join, :last, :pack, :permutation,
 :pop, :product, :push, :rassoc, :repeated_combination, :repeated_permutation, :reverse,
 :reverse!, :rindex, :rotate, :rotate!, :sample, :shift, :shuffle, :shuffle!, :slice,
 :slice!, :sort!, :sort_by!, :to_ary, :transpose, :uniq, :uniq!, :unshift, :values_at]

It's not quite that simple though. Consider this code:

def foo
  # teams best to worst
  teams = ["Yankees", "Cubs", "Red Sox"]
  teams.each {|t| puts t }
end

That usage of an Array has an implicit order - best to worst - but we can't tell that by looking at the methods invoked on the object. There aren't any giveaway usages of shuffle or at or whatever.

For a real-world example of this I poked around the sass gem (v3.4.13) that I'm using in a project. I looked at a bunch of different array usages, but for most of them it seemed like they were relying on the array being an ordered sequence, not just a collection. The one instance that seemed like it might be a candidate was the watched_paths array, since only unique directories are desirable there (per the remove_redundant_directories method) and order doesn't appear to be important. But even if so, what are we talking about here - maybe a dozen or so directories? So not a big savings.

When I first started thinking about this I considered writing a runtime utility to watch usage of Array instances and suggest replacing with other types. It'd be something like pippi except focused on type usage, not method call sequences. I'm still noodling on it, but as things stand I think it would result in too many false positives.

Maybe it would be more useful on gems than Rails apps? I feel like there's not a lot of upside there, though. When I think about the gems I use day to day - aasm, declarative_authorization, nokogiri, etc - most of the time I'm using small collections where iterating over a few dozen items just isn't a big deal. Or I'm using some API wrapper client where 99% of the time is spent waiting for data on a socket. The code might be more clear if a specific type were used, but for the most part, no big deal.

In conclusion, I don't think there's a useful tool to be written here. But comments welcome on the twitters.