Optimizing Cache Performance on a Rapidly Growing Site

Cache performance is essential to site performance, but most folks don’t understand their cache at a deep enough level to make proper engineering decisions. Tools like SimCache can help by predicting cache performance ahead of potentially costly ops decisions.

Introduction

Modern web applications depend heavily on caching to maintain site performance and reduce loads on their primary databases. However, in many cases, caching strategies are deployed in an ad hoc fashion, without much understanding of how underlying usage patterns affect cache performance. In practice, developers tend to spin up a cache (typically Memcache) and continue adding capacity until site performance is “good enough.” However, without deeper understanding, capacity planning and performance tuning will become harder as traffic grows or usage patterns change.

Posterous was no exception; early in our history, we began to use Memcache heavily to quell the increasing load on our MySQL servers, sizing our Memcache cluster with simple heuristics. However, as we began to use Memcache in different ways and as our traffic grew, the cache began to act erratically, leading to site performance issues, despite rapidly increasing the size of our Memcache cluster. We realized that understanding how our cache performed given our observed usage patterns was essential for appropriately sizing our cache. To do so, we developed SimCache, a tool for predicting cache performance based on observed usage patterns, and used it to plan the second version of our cache, based on Redis.

Background

As a consumer-oriented blogging platform, Posterous is an extremely read-heavy app. Moreover, our usage patterns are extremely “long-tail”; at any given moment, we’ll serve thousands of requests for a heavily-visited site like the Gap’s consumer facing blog but just a few for Mrs. Henry’s sixth grade class blog.

To serve our normal stream of requests, we had been using a fairly large Memcache cluster to store formatted blog posts. However, cache performance began to act erratically, with wild swings in the speed of some requests that we couldn’t really understand. Moreover, the cache was being asked to serve a growing number of requests:

traffic

Understanding that this was unacceptable, I began working closely with Chris Burnett, another engineer at Posterous, to assess the degradation in cache performance.

Assessing the Situation

The importance of proper logging and measurement cannot be overstated when assessing the performance of a given caching strategy. Without collecting statistics on your cache performance, you’re essentially blind, with no understanding of how well your cache is working, or how it could be improved.

For a typical key => value cache, collecting statistics is pretty easy to implement. Anytime a key is accessed, simply log whether or not the cache request resulted in a hit or a miss:

Feb 28 01:14:54 hit!  key = posts/1432
Feb 28 01:14:54 hit!  key = posts/2442
Feb 28 01:14:55 miss! key = posts/2970
Feb 28 01:14:55 hit!  key = posts/6917
Feb 28 01:14:57 miss! key = posts/9363
Feb 28 01:14:57 hit!  key = posts/2969

Such simple data can reveal a wealth of insights. Most important is the cache’s miss rate: how frequently do we need to regenerate data? It is the miss rate that ultimately impacts site performance. Using such data, we were shocked to discover that we were caching a lot less than we thought, and that our cache actually behaved quite erratically, with a greater than 2x difference between peak and trough miss rates (1 = baseline):

plot1

Using SimCache to Choose a Caching Strategy

Given our initial assessment, it was clear that we would need to increase the size of our cache. But by how much? Could we expect much improvement if we increased the cache size by a third? What about doubling the cache? Would that sufficiently improve site performance? To answer these questions, I wrote a tool called SimCache which would replay our observed cache access patterns against a simulated cache of a given size, measuring how cache size would affect cache miss rates and other important metrics of caching performance. Using SimCache, we tested how cache performance varied if we increased our existing cache size (red) by:

  • 33% (green)
  • 66% (blue)
  • 100% (magenta):

plot2

The data indicated that our cache was too small by a factor of almost 2x. Moreover, the undersized cache was resonsible for the wild swings in miss rate. Keys were evicted from the cache far too soon; as the cache size was steadily increased, the variation in miss rate went down dramatically, leading to better consistency in hit rates from our cache.

Using the simulation results, we increased our cache to the appropriate size. Of course, it is important to collect statistics afterwards to verify if the change had its intended effect. In our case, the results were pretty good. At time=0, the newer cache was inserted, resulting in a spike in miss rate. However, as the larger cache began the fill, the measured cache performance (green points) matched the predicted cache performance (blue line) very well:

plot3

Conclusion

Using the data from SimCache allowed us to understand why our cache performance was degraded and how to improve it. Moroever, by predicting the required cache size ahead of time, we avoided costly “ops iteration” —– i.e., we did not have to add servers, wait to see if site performance improved, add more cache, rinse and repeat. Instead we were able to size our cache appropriatey from the beginning.

Interested in working on problems like this? We’re hiring.

Thanks to J. Hui, C. Burnett, R. Pearson, D. Meredith, and G. Tan for reading and commenting on different drafts of this post.