Episode #042: Streaming

Upgrade to download episode video.

Episode Script

Sometimes, performance matters. And sometimes the what kills performance is not CPU use or I/O, but memory size. A program that uses massive amounts of RAM, forcing the OS to dig heavily into virtual memory, can bring a system to a thrashing standstill.

Ruby's standard I/O libraries have APIs which are optimized for programmer convenience. Unfortunately, this can sometimes mean that it's all too easy to make memory use go through the roof without even realizing it.

For instance: say I want to dig into some web analytics data from my blog. I have a file called visitors.csv which has a sample of 5,000 site hits. I want to find out how many of the visitors in that group were from San Francisco.

Ruby's CSV library makes this very easy. I require the library, then inn a single line of code I slurp up the whole file and convert it into an array of row objects. Then I use the #count method with a block to count the number of rows with a “Geolocation” field matching San Francisco.

require 'csv'

visitors = CSV.read('visitors.csv', headers: true)
visitors.count{|v| v["Geolocation"] =~ /San Francisco/} # => 168

Let's look at some stats on this operation. To help us, we'll write a little method called memstats which simply uses the ps shell command to get a rough idea of the program's current memory size. Then we'll call memstats before, and after, we load up the file and count San Francisco visitors.

def memstats
  size = `ps -o size= #{$$}`.strip.to_i
  "Size: #{size}"
end

require 'csv'
memstats                        # => "Size: 5152"
visitors = CSV.read('visitors.csv', headers: true)
visitors.count{|v| v["Geolocation"] =~ /San Francisco/} # => 168
memstats                        # => "Size: 20728"

Wow! The process almost quadrupled in size! And the file we've used is just a test file with a small sampling of data. It's a safe bet that as the input files get larger, so will the memory usage. This isn't going to tax my development workstation any time soon, but if I used this code on a memory-constrained web server there's a good chance it would start showing the strain in slow response times or even crashes as the process ran out of memory.

There is a direct relationship between the size of this program's input data set, and it's resident memory size. This is because it creates and retains an object in memory for every single row. If we could somehow avoid this, we could keep the memory usage more manageable.

The answer to our resource usage problem is to process the data in a streaming fashion, tallying up records as they are parsed and then releasing them to the garbage collector, rather than holding them all in memory. To do this, we rearrange our code. We create a count variable that starts out at zero. Then we use the CSV.foreach method to look at each row in turn, incrementing the count if the visitor was from San Francisco.

count = 0
CSV.foreach('visitors.csv', headers: true) do |v|
  count += 1 if v["Geolocation"] =~ /San Francisco/
end
count                           # => 168
memstats                        # => "Size: 5152"

Looking at the stats, this version is a lot more promising. The ps command sees no change in process memory size after the process is finished going through the entire data set.

Of course, to accomplish this we had to discard the syntactic elegance of Ruby's Enumerable#count method, instead counting up matching records manually using a local variable. It's a shame we have to trade expressiveness for efficiency.

…or do we?

Here's another version of the same program. This time, we open the CSV file and then call the #each method, which iterates over rows. But, surprisingly, we don't pass any block to #each! [Editor's note: The bang is on the sentence, NOT on the #each.]

The return value of #each when given no block is an Enumerator object. This object is an external iterator over a collection. You can think of it as a lazy version of the visitors Array we created in the very first version of this program. It's an Enumerable object which is all prepped to read rows from the visitors.csv file—but which won't actually do any reading or parsing until the moment it is asked for data. And even then, it'll only read and parse as much as it needs.

We can see, when we look at the memory stats after it is created, that instantiating this Enumerator of visitors hasn't increased the process size at all.

Now we call #count on the visitors collection, passing it a predicate block, exactly as we did in the original code. We get the same count as before, but this time when we check the stats, we see that the process' memory usage has not grown measurably.

memstats                        # => "Size: 5168"
CSV.open('visitors.csv', headers: true) do |csv|
  visitors = csv.each           
  # => #<Enumerator: <#CSV io_type:File io_path:"visitors.csv" encoding:UTF-8 lineno:0 col_sep:"," row_sep:"\n" quote_char:"\"" headers:true>:each>
  memstats                      # => "Size: 5168"
  visitors.count{|v| v["Geolocation"] =~ /San Francisco/} # => 168
end
memstats                        # => "Size: 5168"

By using streaming processing, we've changed this program so its memory usage stays stable no matter how much input data we throw at it. By taking advantage of Ruby's pervasive use of Enumerator when an iteration method is called without a block, we were able to switch to a streaming style program without losing any of the expressiveness of the in-memory version. This is the kind of win-win scenario that leaves us as happy hackers indeed.