Suppose you came across some code like this:
Probably you would think that it was odd that it was using an Array
for teams
; order doesn't seem to be important. All the operations are supported by Set
which is much more efficient if it does what's needed. So this could be replaced with:
Now we're using constant rather than linear time operations. This doesn't matter here, but if we had a million items in the collection it would make a big difference.
Here's another example of using an Array
when another type would work better:
In this case order is important because clearly the Yankees are the best team of the bunch, ever, in the world. So a Set
wouldn't work. But we can avoid scanning the array with select
by using a Hash
with the city name as the key:
We've replaced another linear time operation with a constant time operation.
This is more than a performance enhancement, though. It also communicates something to the next developer who looks at this code. If we use a Set
, we're communicating that order doesn't matter, and that we'll never expect to be able to index into this collection. If we use a Hash
, we're communicating that we expect lookups to happen a particular way.
So these are worthwhile changes. But in a largish codebase - or even a small one - how do we find these refactoring / performance opportunities? In the case of "replace Array
with Set
", for a given Array
, we need to ensure that we're not using any methods that aren't also available on Set
. So if we have an Array
that's using any of these methods, we can't swap them out:
It's not quite that simple though. Consider this code:
That usage of an Array
has an implicit order - best to worst - but we can't tell that by looking at the methods invoked on the object. There aren't any giveaway usages of shuffle
or at
or whatever.
For a real-world example of this I poked around the sass gem (v3.4.13) that I'm using in a project. I looked at a bunch of different array usages, but for most of them it seemed like they were relying on the array being an ordered sequence, not just a collection. The one instance that seemed like it might be a candidate was the watched_paths
array, since only unique directories are desirable there (per the remove_redundant_directories
method) and order doesn't appear to be important. But even if so, what are we talking about here - maybe a dozen or so directories? So not a big savings.
When I first started thinking about this I considered writing a runtime utility to watch usage of Array
instances and suggest replacing with other types. It'd be something like pippi except focused on type usage, not method call sequences. I'm still noodling on it, but as things stand I think it would result in too many false positives.
Maybe it would be more useful on gems than Rails apps? I feel like there's not a lot of upside there, though. When I think about the gems I use day to day - aasm
, declarative_authorization
, nokogiri
, etc - most of the time I'm using small collections where iterating over a few dozen items just isn't a big deal. Or I'm using some API wrapper client where 99% of the time is spent waiting for data on a socket. The code might be more clear if a specific type were used, but for the most part, no big deal.
In conclusion, I don't think there's a useful tool to be written here. But comments welcome on the twitters.