Root Cause Analysis for This Morning’s Downtime
…and we’re back. What happened?
A non-critical service that was running on our database server was leaking memory, which started chipping away at the memory available to the database. Eventually the database stopped responding, but when that happens in our system, it’s suppose to fail-over to a backup database. Unfortunately, the fail-over didn’t work correctly and left our database cluster in a state that took way too long to recover.
What do we plan on doing about it?
1. We’re going to move any non-critical database service off of the database servers so that any instabilities don’t impact the database.
2. We will make changes to our fail-over strategy that should make the process more reliable. We’ll also thoroughly test this process and procedure to reduce the amount of time it takes when a problem does happen.
3. Break Poll Everywhere into several smaller pieces so that when a certain component, sub-system, or server fails; it affects only a small piece of the application and not the entire website.
Outages like this are absolutely inexcusable, and we are sorry.
Over the last year, we have maintained over 99.9% uptime. But we know that you and your customers use Poll Everywhere under real-time, high pressure situations; and when events like this occur, it chips away at the trust you have in us. We work hard to gain and maintain this trust, and we fully understand how easy it is to lose. When situations like this do occur, we know that the best thing to do is be transparent about whats going on–and what we plan on doing to fix the problem and make sure it never happens again.
Thank you for your patience, and for your continued trust in Poll Everywhere.