Episode #066: Caching an API

Repeatedly hitting an HTTP API for new data can slow down your programs. And if you do it too frequently, can get you banned from making further requests.

A solution is to cache the data you receive. But how do you add caching in a well-tested way? How do you keep the caching layer from disrupting existing tests? And what are best practices for caching API response data?

In this episode, you'll learn answers to all these questions.

Upgrade to download episode video.

Episode Script

Let's say we have a class that wraps remote weather API. When we call the report method with a location query, it makes a request to the service, parses the response, fills in a Weather::Report object with the returned values, and returns the object.

require 'open-uri'
require 'json'

class Weather
  Report = Struct.new(:temperature)

  def report(query)
    key  = ENV['WUNDERGROUND_KEY']
    url  = "http://api.wunderground.com/api/#{key}/conditions/q/#{query}.json"
    body = open(url).read
    data = JSON.parse(body)
    Report.new(data['current_observation']['temp_f'])
  end
end
Weather.new.report(17361)
# => #<struct Weather::Report temperature=34.4>

In a production application, hitting an external service every time the program needs data is often a bad idea. Weather reports don't change on a second-by-second basis. Making a request every time the #report method is called could slow the program down, and if requests are made often enough they might exceed service-imposed limits, causing future requests to fail.

For all these reasons, we'd like to cache the weather reports we get back from this service. Let's add a cache to this class, using tests to guide the way. First, we'll write tests for how the class should interact with a cache collaborator. To do this we'll pass in a test version of the cache. But what sort of test double should this be? What interface should it support?

#<<v1>>
require 'rspec/autorun'

describe Weather do
  describe '#report' do
    it 'uses a cached value when available' do
      weather = Weather.new(cache: ???)
    end
  end
end

What if it isn't a test double at all? What if we just passed a Hash in as as the cache? Let's see where this takes us. We'll pass in a hash with one key, an area code, that maps to a Weather::Report object containing a temperature that the real service is not likely to report.

#<<weather1>>

require 'rspec/autorun'

describe Weather do
  describe '#report' do
    it 'uses a cached value when available' do
      weather = Weather.new(cache: {'17361' => Weather::Report.new(-60.0) })
      weather.report.temperature.should eq(-60.0)
    end
  end
end

We make this pass with a few modifications to the Weather class. We give it the ability to accept a hash of options on initialization, and have it look for a @cache in those options. If it doesn't find one it uses an empty Hash. We then update the #report method to to check the cache for an entry for the current query before using the data returned from the service.

Note that we aren't saying that the cache must always be a Hash. All this test asserts is that the code can use any object that behaves like a Hash as a cache. For the moment, that just means it has to respond to #fetch.

class Weather
  Report = Struct.new(:temperature)

  def initialize(options={})
    @cache = options.fetch(:cache){ {} }
  end

  def report(query)
    key  = ENV['WUNDERGROUND_KEY']
    url  = "http://api.wunderground.com/api/#{key}/conditions/q/#{query}.json"
    body = open(url).read
    data = JSON.parse(body)
    @cache.fetch(query) { 
      Report.new(data['current_observation']['temp_f'])
    }
  end
end

Next, we test that the #report method refrains from making an HTTP request if it finds an entry in the cache. We do this by requiring the WebMock gem, which fakes out web connections. By default it disables all web connections when it is required, so we don't actually have to add any new test code. Running the test now fails because the code tries to hit the weather service even though there is a cached report.

require 'rspec/autorun'
require 'webmock/rspec'

describe Weather do
  describe '#report' do
    it 'uses a cached value when available' do
      weather = Weather.new(cache: {'17361' => Weather::Report.new(-60.0) })
      weather.report('17361').temperature.should eq(-60.0)
    end
  end
end
# >> F
# >> 
# >> Failures:
# >> 
# >>   1) Weather#report uses a cached value when available
# >>      Failure/Error: Unable to find matching line from backtrace
# >>      WebMock::NetConnectNotAllowedError:
# >>        Real HTTP connections are disabled. Unregistered request: GET http://api.wunderground.com/api/d6aaea598a0e4508/conditions/q/17361.json with headers {'Accept'=>'*/*', 'User-Agent'=>'Ruby'}
# >>        
# >>        You can stub this request with the following snippet:
# >>        
# >>        stub_request(:get, "http://api.wunderground.com/api/d6aaea598a0e4508/conditions/q/17361.json").
# >>          with(:headers => {'Accept'=>'*/*', 'User-Agent'=>'Ruby'}).
# >>          to_return(:status => 200, :body => "", :headers => {})
# >>        
# >>        ============================================================
# >>      # -:14:in `report'
# >>      # -:29:in `block (3 levels) in <main>'
# >> 
# >> Finished in 0.00167 seconds
# >> 1 example, 1 failure
# >> 
# >> Failed examples:
# >> 
# >> rspec -:27 # Weather#report uses a cached value when available

We fix this by moving the entire body of the method inside the alternative block for the cache #fetch.

def report(query)
  @cache.fetch(query) { 
    key  = ENV['WUNDERGROUND_KEY']
    url  = "http://api.wunderground.com/api/#{key}/conditions/q/#{query}.json"
    body = open(url).read
    data = JSON.parse(body)
    Report.new(data['current_observation']['temp_f'])
  }
end

Now we have code that can use a pre-populated cache, but won't populate the cache itself. Before we move on to cache population though, let's take a look at our design.

Right now we are caching a Report object. There are several potential problems with this:

  1. Right now we're using in-memory hashes, but we'll eventually be serializing cached data in some kind of persistent key-value store. Consider what would happen if we made a change to the Report class, perhaps renaming the temperature field to temp_f to indicate that it's in Fahrenheit. Unless we were careful to flush all caches when rolling out the new code, we'd risk causing crashes when new code tried to load and use old-style Report objects found in the cache.
  2. Even if we simply added a new attribute to the Report class, for instance wind_speed, we'd still have to expire our caches and rebuild them, or risk getting nil values for the added field.
  3. Finally, storing an object means we have to be careful never to make a change to the Report object which renders it non-serializable—for instance, storing a lambda in it.

In my experience it's better to cache raw response bodies—or sometimes even entire responses, including headers—than to cache domain objects. It prevents object version conflicts, since the domain objects are recreated every time. Storing the entire response means that if we start using more of the response, the data will already be available in the cache. And storing the response raw ensures that the data stored is a simple, serialization-friendly String.

Let's change the code to store the raw response body instead of a Report object. We'll also need to update the test to provide raw JSON in the pre-populated cache.

class Weather
  # ...
  def report(query)
    key  = ENV['WUNDERGROUND_KEY']
    url  = "http://api.wunderground.com/api/#{key}/conditions/q/#{query}.json"
    body = @cache.fetch(query) { 
      body = open(url).read
    }
    data = JSON.parse(body)
    Report.new(data['current_observation']['temp_f'])
  end
end
# ...
describe Weather do
  describe '#report' do
    it 'uses a cached value when available' do
      json = '{ "current_observation": { "temp_f": -60.0 } }'
      weather = Weather.new(cache: {'17361' => json })
      weather.report('17361').temperature.should eq(-60.0)
    end
  end
end

Now let's add an example that shows the code populating the cache when no match is found. This test starts by setting up a fake web response using WebMock. When the code under test tries to make a request for a weather report, WebMock will intercept it and return our snippet of test JSON data as the response body.

We then set up a cache, a Hash that starts out empty. We instantiate a Weather object, passing in the cache, and then request a weather report. After the method returns, we check the contents of the cache to verify that it now contains our fake JSON data, keyed under the given query.

Making this test pass simply requires changing the code so that it updates the cache after making a request.

class Weather
  # ...  
  def report(query)
    key  = ENV['WUNDERGROUND_KEY']
    url  = "http://api.wunderground.com/api/#{key}/conditions/q/#{query}.json"
    body = @cache.fetch(query) {       
      @cache[query] = open(url).read
    }
    data = JSON.parse(body)
    Report.new(data['current_observation']['temp_f'])
  end
end
# ...  
describe Weather do
  describe '#report' do
    # ...
    it 'populates the cache with new values' do
      json = '{ "current_observation": { "temp_f": -60.0 } }'
      expected_url = 
        %r(http://api.wunderground.com/api/.*/conditions/q/17361.json)
      stub_request(:get, expected_url).to_return(body: json)
      cache = {}
      weather = Weather.new(cache: cache)
      weather.report('17361')
      cache['17361'].should eq(json)
    end
  end
end

We now have support for basic caching, using a Ruby Hash as the model for the cache interface. In the next episode we'll look at how to plug in arbitrary key-value stores as the cache implementation, as well as how to expire cache entries. Until then, happy hacking!