Why you should outsource ops

Your site going down is pretty much the worst thing that can happen to a startup. But if you’re successful and your user base is growing fast, it’s bound to happen.

It sucks that your users can’t access your site and you can’t sign up new users. But the problem goes beyond that. At a startup, you’re trying to build features as fast as possible. You’re iterating and launching features every single day.

So your site breaking or having other ops related issues is a distraction. It means you stop building product, and probably means you have to go learn about the gory details of MySQL, nginx, or Delayed Job.

A couple years ago when Posterous started growing like a weed, we decided we needed a sys ops person. Our architecture was complex, issues kept coming up, and we needed to plan for scale.

We looked for someone for a while, but it was hard to find a good ops lead (and it’s probably harder now). You are going to trust your baby with this person!

After doing some research, we decided to outsource ops to a 3rd party called Bitpusher. We found this to be way more effective than to add a dedicated ops engineer to our small team.

Some of the advantages of using Bitpusher vs doing Ops in house:

  1. Bitpusher costs us less per month than a full time hire. Plus we don’t have to pay benefits.
  2. Instead of a single person, we get an ops team. This means we have 24/7 on call support.
  3. Bitpusher has a lot of experience running ops. They have seen all kinds of architectures, and have used all sorts of technologies. They not only run our boxes, but gave us guidance in how to architect for scale.
  4. We had to move datacenters (from Slicehost to Rackspace) and they were able to take on this entire process. They architected the system, transferred the data over, handled monitoring and backups, and more. This was something we weren’t looking forward to doing ourselves.
Ops isn’t our core competency. Where Posterous innovates and excels is helping normal people share photos from their mobile devices with the people they care about. Using Bitpusher allowed us to focus more on our own product and worry less about site operations.

If you’re interested in working with Bitpusher, let me know and I’ll gladly put you in touch. Full disclosure: we get a referral fee if we send them customers!

dynamic preseed file for ubuntu using sinatra

To build ubuntu physical ubuntu servers we use ubuntu preseed.

This works great but if you use a static preseed file you end up building a host that doesn’t have its hostname or static ip address set. This means that you have to manually set it afterward and we decided to automate it.

BTW it took us a while to figure out how to set a static ip in a preseed file. We blogged about it here: network-preseeding-debianubuntu-with-a-static

To do this we wrote a small sinatra app that dynamically generates the preseed file with the hostname and static ip address.

This is done by looking up the mac address of the requested host from the arp table and comparing it to a pipe delimited file that contains the mac address, what the static ip should be and its hostname.

The list is stored in a file named ip2mac.txt and was populated by a script.

The ip2mac.txt file looks like this:

172.28.0.71|a4:ba:db:35:e6:09|chi-devops11a
 172.28.0.72|78:2b:cb:03:c5:44|chi-devops11b

Instead of calling a static preseed file from the pxelinux.cfg/default file we instead make a request to the sinatra app which generates it dynamically. The line in the default file we use looks like this:

append console=tty0 console=ttyS1,115200n8 initrd=ubuntu-10.04-server-amd64-  initrd.gz auto=true priority=critical preseed/url=http://172.27.0.115:4567/lucid-preseed-noraid interface=eth0 netcfg/dhcp_timeout=60 console-setup/ask_detect=false console-setup/layoutcode=us console-keymaps-at/keymap=us locale=en_US --

When the request is made the sinatra app does the following:

*  1. looks up the mac address of the request from the apr table
*  2. compares the mac address to the matching line in ip2mac.txt
*  3. uses the ip and hostname to populate hostname and ip variables in the preseed file
*  4. returns the preseed file to the host making the request

The code:

require 'rubygems' # skip this line in Ruby 1.9
  require 'sinatra'
  require "erb"
  require 'logger'

  def log(message)
    flog = Logger.new('foo.log')
    flog.info(message)
  end

  def lookup_mac(mac)
    rr = Array.new
    hostfile = File.open("ip2mac.txt","r")
    hostfile.readline
    hostfile.each do |line|
      list_ip,list_mac,name = line.split('|')
      if mac.match(list_mac)
    rr.push(list_ip)
      end
    end
  return rr[0]
  end

  def get_mac_address()
    ip =  @env['REMOTE_ADDR']
    cmd = "arp -n " + ip.chomp + " | grep -v Address | awk '{print \$3}'"
    mac  = `#{cmd}`
   return mac
  end

  def rev_lookup(ip)
    cmd = "host " + ip + " | awk '{print \$5}'"
    hostname = `#{cmd}`
fqdn = hostname.chop.chop
return fqdn
  end

  get '/lucid-preseed-noraid' do
    mac = get_mac_address()
    log(mac)
    ips = lookup_mac(mac)
    log(ips)
    fqdns = rev_lookup(ips)
    @ip = ips
    @fqdn = fqdns
    log(fqdns)
    erb :lucid_preseed_noraid
  end

  get '/lucid-preseed-nosrv' do
    mac = get_mac_address()
    log(mac)
    ips = lookup_mac(mac)
    log(ips)
    fqdns = rev_lookup(ips)
    @ip = ips
    @fqdn = fqdns
    log(fqdns)
    erb :lucid_preseed_nosrv
  end

  get '/' do
    "ops11"
  end

To start the sinatra app just run the following:

ruby preseeder.rb

Network preseeding Debian/Ubuntu with a static ip address & dhcp address for initial config

I had a hard time figuring this out and there seems to be lots of conflicting information out there so I thought I'd write down my thoughts about this immediately after I figured it out.

I was trying to get a fully automated network install of Ubuntu 10.04 working with a static ip address that gets set up in the preseed file.  My entire preseed was working but the static ip address was not.  The example preseed has the following entries:

# If you prefer to configure the network manually, uncomment this line and

# the static network configuration below.

#d-i netcfg/disable_dhcp boolean true



# If you want the preconfiguration file to work on systems both with and

# without a dhcp server, uncomment these lines and the static network

# configuration below.

#d-i netcfg/dhcp_failed note

#d-i netcfg/dhcp_options select Configure network manually



# Static network configuration.

#d-i netcfg/get_nameservers string 192.168.1.1

#d-i netcfg/get_ipaddress string 192.168.1.42

#d-i netcfg/get_netmask string 255.255.255.0

#d-i netcfg/get_gateway string 192.168.1.1

#d-i netcfg/confirm_static boolean true

Great...so I should just be able to uncomment the disable_dhcp and the Static network configuration entries and it should work right? WRONG. I went through various permutations and could not get it to work. Finally, I broke down and decided to RTFM and found this:

http://d-i.alioth.debian.org/manual/en.amd64/apbs04.html#preseed-network

It reads:

Although preseeding the network configuration is normally not possible when using network preseeding (using preseed/url”), you can use the following hack to work around that, for example if you'd like to set a static address for the network interface. The hack is to force the network configuration to run again after the preconfiguration file has been loaded by creating a “preseed/run” script containing the following commands:

killall.sh; netcfg

So....I added the following to my preseed.cfg file

d-i preseed/early_command string /bin/killall.sh; /bin/netcfg

and boom the static network config works. Yay!

Posterous Spaces is built on Backbone.js

Man, it's been a while since we've updated this Space. Well, no more! I'm here to tell you a little bit about the technologies behind the Posterous Spaces redesign.

As you may have noticed, Posterous has taken on a new name, and a new look. To go along with those cosmetic changes, we've rewritten our app from the ground up to take advantage of cutting edge technologies. Using these technologies has allowed our team to work at a feverish pace to deliver you a Posterous that is faster, more engaging, and more fun!

At the core of our new stack is our very own Posterous API. This has effectively made us the biggest consumer of our own API. That's pretty nifty, but I'm not going to talk about the API today. 

To interact with the API, we've used some awesome front-end technologies. Among all the awesome stuff we've been able to work with while creating Spaces, the most notable are Backbone, CoffeeScript, Haml.js, Sass, and Compass. Today I will touch on our use of Backbone.

 

For those unfamiliar, Backbone is an MVC-esque framework for JavaScript. It separates large JavaScript applications into models, views, collections, and routers.

Backbone provides some basic structure to a large JavaScript codebase. This has allowed us to create readable, and most importantly reusable, classes that separate functionality from presentation, which is a constant struggle in front-end programming.

Lest this turn into a primer on MVC 101, I'll just outline how we're using Backbone classes in our application:

Models & Collections: These serve as an interface to our API. For each model we want to interact with in the front-end—for example, a post—we create a subclass of Posterous.Model (a subclass itself of Backbone.Model). If we want to deal with lists of a particular model, as we do with lists of posts, we must also create a subclass of Posterous.Collection (a subclass of Backbone.Collection). With both a Posterous.Model and Posterous.Collection, we now have a link between the front-end and our RESTful API.

Routers: For those familiar with Rails development, a Backbone.Router is very similar to your routes.rb file. For each URL on Posterous Spaces, a router fires and tells our app to render a view (or sometimes two views in the case of multi-column layouts).

Views: The meat of our business logic occurs here. The term "View" is a bit misleading to us; we tend to use Backbone views in a manner similar to UIViewControllers in the iOS world. Views observe events, and fire responses. Views also render templates that we have built in Haml.js. 

A typical page in Posterous Spaces is actually a tree of views and subviews, each observing behavior within its outermost DOM elements. For example: when you click on the "Reader" tab, we are actually instantiating a ReaderListView, which in turn contains many PostListItemView instances. Within the PostListItemView, we instantiate a LikeButtonView, among other things.

I'll leave it at that, for now. I know this is just a birds-eye view of our architecture, so please feel free to ask any questions you may have about our use of Backbone (or other front-end technologies) in the comments.

 

We're hiring!

If you're interested in using cutting-edge technologies to build user interfaces that delight millions of people, definitely check out our open job listings page!

 

Join us at the first-ever Posterous Hack Day: July 16th in San Francisco and on IRC

Fresh off the launch of our new API, we're hosting a Hack Day on Saturday July 16th to provide anyone using the Posterous API with direct access to our development team.

Who: Any developer interested in using the Posterous API to build something cool.  So far, we know of mobile apps and a few web services built on top of Posterous. Whether you need ideas on what to build, are looking to team up with someone else or are already working on an app, the Posterous dev team will be here to help.
 
What:  API overview sessions, office hours for anyone with questions and end of day demos.  Plenty of food, red bull and beer.  After we're done, we'll take everyone out for drinks.   
 
When: Saturday July 16th 10am - 6pm PDT
 
Where: Posterous HQ at 2973 16th Street in San Francisco.   Also available via IRC ( #posterous-dev on freenode).

Why: Learn the best ways to use our technology stack. You'll meet other Rails experts. You could win a prize.

How:  Tell us that you are coming.  Give  us your API feature requests in the comments so we can add them before the 16th.  Then just show up with a laptop, your brain and your appetite.

Mission_local_-_credit_jessica_lum
(photo by Jessica Lum)

July 12th update: The first 50 in-person attendees will receive the official Posterous Hack Day shirt.  

Hackshirt

 


 

Announcing our new API - developers now have full access to the Posterous technology stack.

Today, we're happy to announce a new API that allows third-party developers to access the full Posterous technology stack.

The new API gives developers unprecedented access to methods and actions that were formerly available only to the core Posterous engineering team including the ability to create sites, add users for those sites and assigning  custom themes for each user.   Additionally, we've added API methods for retrieving and manipulating data around sites, users, posts, comments, and a number of other Posterous data types.

Posterous_api_reference

Aside from just adding new endpoints, we've also designed the API to be super easy to use. The new API is RESTful and presents a clean and concise set of URLs whose intent is easy to parse and understand. Moreover, we've designed the API site to be a powerful developer tool for working with the new API. Not only does this site document every single available method, it also allows developers to interface directly with the API from their browsers. Using this new tool, developers can experiment by dynamically changing the parameters and inspecting the response.

Posterous_api_reference-1

The use cases for the new API are impressive - whether you are empowering your users to be editors like Pulse did or distributing your content in real-time like Turner Broadcasting did for March Madness, our API can power it.  Another great example is Oxfam, who are using the Posteorus API to to drive awareness and participation in their recently announced campaign to grow a better future.   Anyone visiting the Oxfam grow site will soon be able to sign up and create a blog, hosted on the grow.gd domain (e.g. chris.grow.gd) with a custom theme developed by Obox.  All of the Grow blogs will contain an embedded grow widget to educate consumers on the growing food crisis and solicit their ideas for creating a different future.   And, of course, anyone signing up through Oxfam will have a full Posterous account and access to all our feature.

Grow_gardeners-1

We can't wait to see what you do with it and welcome your feedback on how we can make it better. Start by checking out our new API site.

Webkit Hardware acceleration bleeding into subsequent elements, and how to fix it

I am reposting this from my personal blog because the solution I described is in use at Posterous, and I just think it was a really interesting problem to solve.

The problem

I'm really sensitive to stuff like anti-aliasing, so when elements on one of the pages I was styling ceased to be anti-aliased in Safari and Chrome, part of me died a little.

I suspected this was somehow related to the fact that I was using a 3D CSS transform on a small part of the page (for a little extra zazz—I like zazz), so I tried to narrow it down to a minimal reproduction. Sure enough, the 3D CSS transform was causing the anti-aliasing to disappear. 

Doing a little reading, I found that Webkit switches hardware acceleration on and off in parts of a page depending on a number of factors. If your element has any kind of 3D transformation applied to it, it gets hardware acceleration. Even if you change the opacity of an element, it gets hardware acceleration. When an element gets hardware accelerated in Webkit, sub-pixel anti-aliasing no longer works.

That's all well and good, but there were no 3D transformations applied to most of the page! Yet I was still seeing the aliasing. At this point, I realized that this must be some sort of browser quirk, so I set out to fix the problem. 

For a minimal test case, I created this HTML

The HTML resulted in this (notice how the second article's heading is aliased):
Screen_shot_2011-03-31_at_9
To be fair, the bottom heading *was* anti-aliased, but it wasn't sub-pixel anti-aliased, and the anti-aliasing still left the text looking jagged.

The solution(s)

My old foe, position:relative, was the culprit (you sneaky, sneaky property). Due to other elements on the page, the article element had position:relative. Removing this property got rid of the aliasing problem! I'm not sure why this was the case, but it worked for me.
Screen_shot_2011-03-31_at_10
Another solution is to explicitly set -webkit-font-smoothing:antialiased. The downside to this is your anti-aliasing will not match the rest of the page (since the rest of the page is sub-pixel anti-alised). I prefer sub-pixel anti-aliasing, but that's just me.

Anyway, I know this is a pretty obscure problem to have, but I'm sure there's at least one of you out there scratching your head about it.

N.B.: It looks like the latest Webkit nightly fixes this problem, so look forward to better looking text.

We're Hiring

If problems like this interest you, or your share my hatred for anti-aliasing bugs, Posterous is hiring UI Engineers (among other positions).

 

Optimizing Cache Performance on a Rapidly Growing Site

Cache performance is essential to site performance, but most folks don’t understand their cache at a deep enough level to make proper engineering decisions. Tools like SimCache can help by predicting cache performance ahead of potentially costly ops decisions.

Introduction

Modern web applications depend heavily on caching to maintain site performance and reduce loads on their primary databases. However, in many cases, caching strategies are deployed in an ad hoc fashion, without much understanding of how underlying usage patterns affect cache performance. In practice, developers tend to spin up a cache (typically Memcache) and continue adding capacity until site performance is “good enough.” However, without deeper understanding, capacity planning and performance tuning will become harder as traffic grows or usage patterns change.

Posterous was no exception; early in our history, we began to use Memcache heavily to quell the increasing load on our MySQL servers, sizing our Memcache cluster with simple heuristics. However, as we began to use Memcache in different ways and as our traffic grew, the cache began to act erratically, leading to site performance issues, despite rapidly increasing the size of our Memcache cluster. We realized that understanding how our cache performed given our observed usage patterns was essential for appropriately sizing our cache. To do so, we developed SimCache, a tool for predicting cache performance based on observed usage patterns, and used it to plan the second version of our cache, based on Redis.

Background

As a consumer-oriented blogging platform, Posterous is an extremely read-heavy app. Moreover, our usage patterns are extremely “long-tail”; at any given moment, we’ll serve thousands of requests for a heavily-visited site like the Gap’s consumer facing blog but just a few for Mrs. Henry’s sixth grade class blog.

To serve our normal stream of requests, we had been using a fairly large Memcache cluster to store formatted blog posts. However, cache performance began to act erratically, with wild swings in the speed of some requests that we couldn’t really understand. Moreover, the cache was being asked to serve a growing number of requests:

traffic

Understanding that this was unacceptable, I began working closely with Chris Burnett, another engineer at Posterous, to assess the degradation in cache performance.

Assessing the Situation

The importance of proper logging and measurement cannot be overstated when assessing the performance of a given caching strategy. Without collecting statistics on your cache performance, you’re essentially blind, with no understanding of how well your cache is working, or how it could be improved.

For a typical key => value cache, collecting statistics is pretty easy to implement. Anytime a key is accessed, simply log whether or not the cache request resulted in a hit or a miss:

Feb 28 01:14:54 hit!  key = posts/1432
Feb 28 01:14:54 hit!  key = posts/2442
Feb 28 01:14:55 miss! key = posts/2970
Feb 28 01:14:55 hit!  key = posts/6917
Feb 28 01:14:57 miss! key = posts/9363
Feb 28 01:14:57 hit!  key = posts/2969

Such simple data can reveal a wealth of insights. Most important is the cache’s miss rate: how frequently do we need to regenerate data? It is the miss rate that ultimately impacts site performance. Using such data, we were shocked to discover that we were caching a lot less than we thought, and that our cache actually behaved quite erratically, with a greater than 2x difference between peak and trough miss rates (1 = baseline):

plot1

Using SimCache to Choose a Caching Strategy

Given our initial assessment, it was clear that we would need to increase the size of our cache. But by how much? Could we expect much improvement if we increased the cache size by a third? What about doubling the cache? Would that sufficiently improve site performance? To answer these questions, I wrote a tool called SimCache which would replay our observed cache access patterns against a simulated cache of a given size, measuring how cache size would affect cache miss rates and other important metrics of caching performance. Using SimCache, we tested how cache performance varied if we increased our existing cache size (red) by:

  • 33% (green)
  • 66% (blue)
  • 100% (magenta):

plot2

The data indicated that our cache was too small by a factor of almost 2x. Moreover, the undersized cache was resonsible for the wild swings in miss rate. Keys were evicted from the cache far too soon; as the cache size was steadily increased, the variation in miss rate went down dramatically, leading to better consistency in hit rates from our cache.

Using the simulation results, we increased our cache to the appropriate size. Of course, it is important to collect statistics afterwards to verify if the change had its intended effect. In our case, the results were pretty good. At time=0, the newer cache was inserted, resulting in a spike in miss rate. However, as the larger cache began the fill, the measured cache performance (green points) matched the predicted cache performance (blue line) very well:

plot3

Conclusion

Using the data from SimCache allowed us to understand why our cache performance was degraded and how to improve it. Moroever, by predicting the required cache size ahead of time, we avoided costly “ops iteration” —– i.e., we did not have to add servers, wait to see if site performance improved, add more cache, rinse and repeat. Instead we were able to size our cache appropriatey from the beginning.

Interested in working on problems like this? We’re hiring.

Thanks to J. Hui, C. Burnett, R. Pearson, D. Meredith, and G. Tan for reading and commenting on different drafts of this post.