Surveying gemspec file specifiers

08 Jun 2017

Recently I was fiddling with a gem and noticed that I was including some extra files when building it. That was easy enough to clean up using the Gem::Specification#files attribute, but poking around there made me wonder how other folks were specifying their file lists. There are a bunch of different ways to do it, and I bet it's the sort of thing that gets cargo culted from gem to gem; I know I've done that.

Here's a common usage pattern:

s.files = `git ls-files -z`.split("\x0").reject {|p| p.match(%r{^(test/|docs)}) }

Using git ls-files ensures you only get files that are committed to the repository, and -z uses a null byte line terminator rather than a newline. I'm not sure, but I think the idea behind using a null terminator as a delimiter is that that value is never going to be part of a filename (confirmed by tkannelid, thanks!). After the file list is generated it's just a matter of splitting the list and rejecting any filenames that don't match a regular expression.

Here's another one, this time from rack-mini-profiler's gemspec:

s.files = [
  'rack-mini-profiler.gemspec',
].concat( Dir.glob('lib/**/*').reject {|f| File.directory?(f) || f =~ /~$/ } )

This starts with a "seed" array and concatenates any files in lib, then removes any directories or vim temporary files. The glob could be rewritten a little more concisely as Dir['lib/**/*'].

Here's a variant on the git ls-files method. It doesn't use -z and relies on Ruby's String#split default behavior, which is to split on any whitespace. Note that ls-files accepts both individual files and directories as arguments:

s.files = `git ls-files README.md CHANGELOG.md ext lib support`.split

The downside there is that it would break if any of the filenames contained spaces.

Here's another variant. The interesting part of this one is that the regular expression in the select block uses ?: which is a non-capturing group:

s.files = `git ls-files -z`.split("\x0").select do |f|
  %r{^(?:bin|lib)\/} =~ f
end + %w(CHANGELOG.md CONTRIBUTING.md LICENSE.txt README.md)

I'm not sure why it does that. Benchmarking shows non-capturing groups giving a minor improvement over a zillion iterations:

>> count = 10000000 ; Benchmark.bm(1) {|x|
  x.report("a"){ count.times {"heyo" =~ /(foo|bar)/ } } ;
  x.report("b"){ count.times {"heyo" =~ /(?:foo|bar)/ } }
}
        user     system      total        real
a   3.500000   0.000000   3.500000 (  3.519035)
b   3.380000   0.010000   3.390000 (  3.388738)

But in this context it seems unnecessary since the array is only going to have a few dozen files. And I would think there'd be a tiny performance hit up front for Ruby to compute the NFA for the more complicated expression. Still, from a readability perspective, it does signal that the group isn't going to be referenced later on, so maybe that's a win.

No real takeaways on this one, just messing around. Always neat to see how different projects do things. Enjoy!