Performance and Reliability
(tl;dr) No other company in our market is treating its growing pains as professionally as we are. Check out the new graph on http://status.polleverywhere.com/
Growing big suddenly is painful. I hope to convince you it’s also an advantage to us in the future.
We really struggled through September and October. We saw a 700% increase in usage in just 6 weeks. It was like doing 50MPH on the freeway while hanging out the window so you can bolt in a new engine that can go 400MPH.
Engineers have a decision to make when building systems for sudden “spiky” workloads:
- Bend: Prevent the site from going down. Work on things as fast as you can, but people will experience slowness. Consider though: If poll responses are running on a 90 second delay, we just screwed you on stage. You put up your poll, the audience responded, and now you’re stupefied, staring at a dead empty chart in front of 300 of your would-be fans who are now tweeting about your epic fail.
- Break: When stuff gets really heavy, just ignore and drop what can’t be handled within 5 seconds. Get back to some people quickly by sending errors to others. Nobody is left staring at an unresponsive slide, but now the service is utterly unusable for attendance, grading, or accurate counts.
We chose bend, and some presenters paid the price. For those that contacted us, we stammered out our most sincere apology. Instead of being the $140 hero, you catch your boss’s gaze in the audience and wish that you had recommended the $11,000 ARS clicker rental. You know what’s sad? A full refund will never even come close to making up for the embarrassment, and it’s not like you’re in the mood to appreciate complimentary service from us in the future!
Can’t you just buy more servers?
Oh, how badly we wish. Wherever pouring money on the problem was possible, we did. It turns out that “elastic scalability” is still a hard problem in computing. It’s a little like telling a packed room full of people to exit faster and trying to buy them all scooters. It’s not getting better until you take weeks to add more doors.
So what are you doing about it?
A lot of things.
- The site is faster than ever. We now have all of October’s workload operating at twice the speed-per-user of our fastest month in the past.
- We communicate publicly and transparently during problems. Our status site and Twitter stream show this.
- Boring geek things including much bigger, faster servers. We also have two of everything so if a component in our stack fails, the stand-by component will seamlessly take-over.
- We built complex tools to see our problems clearly. We’re sharing them with you in order to raise our accountability. The new real-time graph on the status page shows you exactly what you care about: When the system is bending, is it still fast?
I’m proud that during our toughest growth pains, we performed much better than Several Other Companies Racing Against The Influx of Various Educators. It’s uncomfortable to own your downtime and communicate openly, but it’s part of being professional and earning people’s trust.
Here’s a final example: On Friday from 10:00 PM to 1:00 AM, we needed a three hour maintenance window to move to our new servers. We communicated this scheduled down-time days in advance. We emailed customers. We tweeted. We put up live site-announcements on all pages.
On Saturday, a competitor had over 4 hours of downtime without a visible peep.
We’re obsessed with speed, reliability, and simplicity because we know its something that all of our customers absolutely demand. It hints at a bigger discussion: our strategy of serving stadiums, corporate meetings, nonprofits, marketers, and educators all at the same time will make critical aspects of our service better than anyone who tries to specialize on one. Think about it: is Gmail for educators much different than Gmail for anyone else? No, you just want an email application that works. But that’s a topic for a future post.