Tracking down a Rails app memory leak

20 Sep 2007

Recently on thenewsroom we've noticed a memory leak; the process size would just grow and grow up over 1 GB per Mongrel. Restarting them regularly kept things more or less under control, but that's silly. Much better to find the actual problem.

I tried out various memory profilers, like mongrel -B and Scott Laird's MemoryProfiler. But the results just didn't jump out at me; I spent quite a while poking around Objectspace dump files and String diffs and such without really making much progress.

Then I tried Evan Weaver's BleakHouse. This was nicer; the Gruff graphs are cool and it showed that the leak was in the way we were using Ruby and Rails. So we didn't have a class Hash in a controller or anything like that. But I still couldn't pin it down; I was just seeing charts with upwardly sloping numbers of String objects.

Finally I backed down to just poking around the app while watching the output of:

watch -n 1 "ps -o rss,vsz -p 12345"

Nothing interesting surfaced at first. OK, what if I look at the differences between my dev and our production database? Ah ha! A particular table in my DB has 200 rows, the production one has 10K rows, and a particular page hit by a particular user causes 1500 of them to be fetched. And that corresponds with a huge leap in memory size.

Here was the problem. Our architecture involves various Rails apps passing data around - and we were using Marshal'd OpenStruct objects as the transfer mechanism. This is a bad idea. Here's a demo of why. First, let's write 10K Marshal'd OpenStruct objects to a file:

require 'ostruct'
structs = []
10000.times do |x|
  structs << OpenStruct.new(
   :first_name => "Fred#{x}", 
   :last_name => "Fredson#{x}", 
   :hatsize => 12+x, 
   :country => "FooLand#{x}")
end
File.open("data.dat", "w") { |f| f.syswrite(Marshal.dump(structs)) }

This creates a data.dat file of about 500K. Now let's read them back in, but check the process size at various points. Here's a read.rb:

require 'ostruct'
puts "Hit enter to read data"
gets
puts "Reading data..."
data = Marshal.load(File.read("data.dat"))
puts "Hit enter to set to nil"
gets
data = nil
GC.start
puts "It's nil, back from GC.start"
sleep 500

When we run read.rb, the process is about 3 MB:

$ ps -o rss,vsz -p 30079
RSS    VSZ
1516   3048

Now, after we read in the data:

$ ps -o rss,vsz -p 30079
RSS    VSZ
80600  82104

Great scott, 82 MB! But surely that GC.start will clean them up, right?

$ ps -o rss,vsz -p 30079
RSS    VSZ
80620  82104

And, no. Booooo. But replacing that OpenStruct with a Hash causes the memory size to only go from 3 MB to 7 MB, which is a bit of an improvement over rising from 3 MB to 80 MB.

What would have been nice in this situation would be to have some sort of heap analyzer that showed the size of memory occupied by each object graph subtree. It'd be worth looking at Evan's patch to Ruby 1.8.6 and his profiler to see if we could somehow gather and dump enough information to create something like that. What a cool project that would be for someone much smarter than I :-)