Loggly

Close

If you don't know the subdomain for your account, you can retrieve it by resetting your password. If you don't have an account, signup now.

Blog / Article

Everyone's Talking About AWS Being Down

Posted 22 Apr, 2011 by Kord Campbell in Business and Startup

Everyone seems to be blogging about how their service has ben impacted by Amazon’s AWS outage, or whining about how Amazon sucks, or explaining why their service was architected so well that it didn’t impact them, or why you suck if you didn’t plan for this.

As Loggly is based entirely on AWS and was only minimally impacted during the first few hours of the start of the outage, I figured I’d share exactly how we managed to do what we did:

We run across multiple availability zones and don’t rely on EBS for anything other than backups of a few simple databases. Everything else is file based the EC2 instances and their drives are set up in a RAID-1 configuration for speed and slightly more reliablity. Our log streams are backed up to S3 every few minutes as they come into our proxies. We rely on RDS for the database for the user logins, which really ended up being the only thing affected.

I asked Jordan Sissel, Head of Ops and Senior Developer here at Loggly to describe exactly what happened when Nagios/Pagerduty went off a night before last. Here’s what he said:

I saw RDS (prod db) problems in the early morning just as the problems started, but by the time I started debugging it the problem went away. I was notified by pagerduty because beaveroil and some other checks were failing.

Otherwise we weren’t really impacted. We got lucky, I think. I kept my eye on service but it stayed happy during the AWS outages.

Worst case, it’s easy for us (assuming rightscale is functioning, which it wasn’t for some of the day) to migrate to different parts of EC2 due to our use of puppet and are lack of EBS usage (only our RDS uses EBS)

Planning for Failure

Jordan is right, we can pretty much do a Loggly deployment on any AWS region within 20-30 minutes. Because we use Zerigo for DNS, and because we keep short TTLs, we can switch out records and have them updated quick and redirect ALL our inbound and outbound traffic to the new deployment.

Of course that leaves the question about migrating data on our existing or failed indexers. Thankfully that’s not a huge issue for us because we can rebuild them using EMR from our S3 backups.

Before we launched Loggly’s public service, I mandated that Loggly must be able to rebuild failed indexers at will. The work required to support this ended up delaying our public launch by at least 45 days. Now if we lose a box or have to move to another region, we can rebuild any (or all) of our indexers in at most a few hours. In theory we could continue to index new data coming into the system and historical search impact to customers would be minimal.

As my friend Clay Loveless put it so elegantly, “We may rip the rug out from under your feet at any moment.” If you haven’t planned for disaster striking, then you should go back and reassess your infrastructure. Hopefully we’ve planned well for just such a disaster.

<knocks on wood>

Share Your Thoughts

Blog Categories

Search

Loading

Archives by Month