The Coca Cola Super Bowl Poll Blackout
During the Super Bowl, Coca Cola ran a campaign at cokechase.com where people could vote on the ending of their commercial. Unfortunately, when the page was loaded, it looked like this:
The Coke Chase website appeared as a blank desert for many people during the Super Bowl because of technical difficulties.
Live events at this scale are really tough. We’ve done our fair share of big live events, like the Man vs Food Super Bowl special, as well as hundreds of high stakes corporate events, and got through these with the following design philosophy:
- Design for failure
- Minimize the number of moving parts on the server
- Cache aggressively
- Queue aggressively
Design for failure
“Assume the worst and hope for the best” is a great strategy for handling spikes for huge events, especially the Super Bowl.
The worst case scenario we design for at Poll Everywhere is that the user’s Internet connection completely dies. As you can imagine, this is a very extreme event since web applications need Internet connections to work, but we handle this anyway. This may happen if they’re on an unreliable cellphone network or if something terrible happened to our servers and they went down.
When the connection comes back, we inform the user that they can continue using the application.
These notifications keep the user in the know so that they can start figuring out if something is wrong with their Internet connection and take steps to correct it.
We look at all pieces of our application this way, and when all of that fails, we give our users the heads up at status.polleverywhere.com that something bad is happening.
Minimize the number of moving parts on the server
The best way to scale an application is to have it do very little work so that it can handle more requests. Our current application development strategy at Poll Everywhere is to shift all the work that needs to be done for a web page from the server to the users web browser. Our mobile application at PollEv.com has zero moving parts and consists of just three HTML, CSS, and JavaScript files. The only thing our server has to do is serve up these three files and it’s done. We shift even more load off of our servers by using a Content Distribution Network to handle the delivery of these files, which results in a faster, more reliable experience for our customers.
Cache aggressively
When you request a web page, the browser has to connect to a server far far away through the Interwebs and download the content to your computer for display. If the server is configured properly, it can tell the browser, “Hey, the files you downloaded are valid for the next 30 years.” The next time the browser requests this page, it checks for a copy on your computer and uses that if it’s valid.
Why is this so important? Well, if the server goes down because of an insanely high volume of traffic, people tend to refresh their browsers to “fix” the problem. These refreshes trigger a request to the server which creates even more load and compounds the problem. Things can get out of hand pretty quickly. If the user downloads these files to their cache and then refreshes the page, a substantial amount of load can be taken off of the servers.
Imagine if every time you got hungry at home you had to drive to the grocery store to pick up whatever food you desired. Three trips per day for breakfast, lunch, and dinner? That’s crazy! Fortunately we have refrigerators, which can be thought of as a food cache.
When you’re hungry in front of the television you can walk to the fridge, eat some food, and sit back down in a few minutes. When the fridge runs low or the food spoils, you jump in the car, fill up a cart, and bring it home. This trip might take an hour on a good day or a few hours when the grocery store is a mad house, but you replenish the food cache and make trips from the television to the fridge take a fraction of the time of a full-blown trip to the store.
Queue responses aggressively
So far we’ve put a bunch of strategies in place to deal with users requesting the web page where vote is cast, but what about handling huge volumes of incoming votes?
Its a lot like the real world actually. Imagine if a Super Bowl poll was written on a chalkboard that had a column per poll option. For people to vote, they have to walk by and check their choice under the column. If 100,000 people stormed the chalk board at the same time to mark their response, it would be total chaos.
The trick is to make people stand in a single file line and check the option they’d like as they walk by in an orderly fashion and things can be kept under control.
A line with 100,000 people? That seems daunting! You bet it is, but if each person takes a few milliseconds to mark their response, the entire queue can be processed in a few minutes. Split the single chalkboard into one chalkboard per poll option, and you can split the line for faster, parallel processing.
We aggressively employ queueing to make sure that we quickly receive responses from our users, then they stand in line on our server for a few milliseconds, before being processed.
Scaling requires years of experience in operations and understanding in a particular problem domain
On the surface, polling applications seem very simple, but the devil is in the details, especially for large events like the Super Bowl. There are a lot of other very interesting design issues that I didn’t cover above that we employ to make sure that we can scale for large events, but the techniques and approaches above will get you pretty far.
Of course, Poll Everywhere is here for you if you have a big event that you want to host without sweating the details.