Loggly

Close

If you don't know the subdomain for your account, you can retrieve it by resetting your password. If you don't have an account, signup now.

Blog / Article

Loggly's Outage for December 19th

Posted 19 Dec, 2011 by Kord Campbell in Business and Startup

Sometimes there's just no other way to say  "we're down" than just admitting you screwed up and are down.  We're coming back up now, and in theory by the time this is read, we'll be serving the app again normally.  There will be a good amount of time until we can rebuild the indexes for historic data of our paid customers. This is our largest outage to date, and I'm not at all proud of it.

So What Happened?

Sometime yesterday afternoon ALL of our machines on Amazon's East region, availability zone 1d, were rebooted by AWS staff.  Originally we stated we had not received reboot notices from Amazon, but the truth is that (4) of the staff here, myself included received two separate vague notices, one from about 10 days ago, and another from 3 days ago, which stated 'some or all' of our instances were scheduled to be rebooted.  These notices were found in our spam folders on Gmail, placed there with a very large red notice reading: "Warning: This message may not be from whom it claims to be. Beware of following any links in it or of providing the sender with any personal information."  Meh.

Loggly uses a variety of monitoring mechanisms to ensure our services are healthy.  These include, but are not limited to, extensive monitoring with Nagios, external monitors like Zerigo, and using a slew of our own API calls for monitoring for errors in our logs.  When the mass reboot occurred we failed to alert because a) our monitoring server was rebooted and failed to complete the boot cycle, b) the external monitors were only set to test for pings and established connections to syslog and http (more about that in a moment), and c) the custom API calls using us were no longer running because we were down.

Combined, these failures effectively  prevented us from noticing we were down.  This in of itself is was the cause of at least half our down time, and to me, the most unacceptable part of this whole situation.

The Human Element

The other cause to our failures is what some of you on Twitter are calling "a failure to architect for the cloud".  I would refine that a bit to say "a failure to architect for a bunch of guys randomly rebooting 100% of your boxes".  A reboot of all boxes has never been tested at Loggly before.  It's a test we've failed completely as of today.  We've been told by Amazon they actually had to work hard at rebooting a few of our instances, and one scrappy little box actually survived their reboot wrath.

While some might go on a rant about how 'normal' failures don't affect 100% of your boxes the truth is that any and everything (including an army of reboot monkeys) can be expected to happen to your servers if you wait around long enough.  The trick to being good at running a reliable service is to architect around any number of everythings that could happen to your service and build for it.

In this case we didn't ever build the workaround simply because the system we run - a combination of 0MQ+Solr+Zookeeper+Loggly Special Sauce - makes it extremely challenging to survive a complete failure with more than 1/2 of the cluster missing.  With other challenges facing us, we decided to live with the risk. Now we're dealing with the fallout of our decision.

So, How Do We Make This Right?

Single instances of Loggly's search cluster can't be spread across multiple availability zones or regions due to the amount of data we push around, latencies between the search nodes, and the lack of support in our system for redundant indexes.  We've been OK with those limitations in the past simple because we normally archive data to S3 when we catch it, and we are capable of rebuilding indexes on the fly if we lose one or more indexers.

The first step in addressing this is to start sharding our customers across multiple Loggly deployments.  This will prevent further outages to the entire customer base.  The second step is to start doing Loggly deployments on dedicated hardware.  Because we keep large amounts of data on our boxes (the indexes) this is pretty much a requirement for fast recovery times when a deployment goes down.  While S3 is AWESOME for backups, it sucks big-time for rebuilding a large amount of search index data.

The second step is to ensure more robust external monitoring.  With multiple deployments, this issue becomes less of an issue, but clearly we need more reliable checks than what we rely on with Zerigo or other services.  Sorry, but simple HTTP checks, pings and established connections to a box do not guarantee it's up!

Finally, we accept full responsibility for the impact to our customers.  We will be in touch with our paid customers sometime over the next week to address compensation for this outage.

We welcome feedback below, and encourage useful criticism of our architectural choices.  All I would ask is that you consider Loggly's infrastructure isn't the same as yours, and I've greatly simplified the reasons for not being more redundant in our deployments.  We can, and will, endeavor to do better in the future.

Kord Campbell, CEO

  • gba

    gba 19 Dec, 2011 06:21pm

    We too had to suffer through the great AWS rebootpocalypse of 2011, and it wasn’t pleasant.

    As a former sysadmin of 10+ years, I feel your pain. I always think of MacGuyver when stuff like this happens. That is, no matter how many tools you’ve got on your toolbox at home, odds are you’ll be taken hostage in a 3rd world prison camp when sh*t goes down, with nothing more than a stick of gum and a bobby pin to rely on.

    Good luck!

  • Jon

    Jon 19 Dec, 2011 09:28pm

    “We accept full responsibility” while calling the AWS staff monkeys and complaining about your mail client settings? That’s just weak and doesn’t impart confidence. It’s your app, your choice that it lives on the cloud, so take ownership and truly accept responsibility. This was a poor apology.

  • donavan

    donavan 19 Dec, 2011 10:41pm

    Who’s “fault” would it be if PG&E/Equinix/Level 3 dropped your joules/bits on the floor? How is AWS bouncing DomUs any different? Is this not just another classic example of SPOF and redundancy when it comes to DC operations?

  • Jon

    Jon 20 Dec, 2011 01:14am

    You take full responsibility in the very end of this post, while the first para is basically explaining why you’re all pretty much justified in not having been aware of the reeboots. Weak.

    A proper post mortem shouldn’t be justifying why really, this outage couldn’t be helped, honest guys, it should be a statement of what caused the outage and what you’re doing to make sure it doesn’t happen again.

    Going on about how the downtime notification was stuck in your spam folder just makes you look like amateurs.

  • Radek

    Radek 20 Dec, 2011 05:19am

    You can find out about reboots via API
    http://www.elastician.com/2011/12/dont-reboot-me-bro.html

  • Greg Arnette

    Greg Arnette 20 Dec, 2011 06:23am

    Wow… this is astounding that your service was affected so severely by the EC2 Fleet Upgrade. Here is @sonian take on this. http://goo.gl/A83bC

  • Brian

    Brian 20 Dec, 2011 10:54am

    Wow. Looks like it’s armchair quarterbacking time for guys named Jon in here. How about rather than complaining and accusing people of not doing the right thing, you sit back and think about the fact that NOBODY can ever plan for EVERYTHING. It’s impossible and the costs involved in trying to predict and and every problem would make services like Loggly impossible to run or afford. Live and learn. Deal with it. I’m sure the guys at Loggly are modding their mail filters. I’m glad they included that information as it may help others to avoid the same mistake. Oh…and if you don’t like the way Loggly handled it, you’re free to go use something else…which likely hasn’t planned for everything either.

  • Kord Campbell

    Kord Campbell 20 Dec, 2011 12:10pm

    If I’ve learned anything about running web services others, it’s this: always tell the truth. It may hurt at bit when you do, and it may give people excuses to immediately bash you, but it really does sets you free. Free to speak your mind. Free to help others learn from your mistakes.

    Those of you claiming we’re not taking a decent amount of responsibility can go juggle chainsaws. I owned we were at 100% at fault several times here: at fault for picking AWS as our infrastructure, at fault for not monitoring ourselves better, at fault for not being able to recover faster, at fault for not being more distributed, and lastly at fault for building a beast that is difficult to run in the cloud. I also clearly stated we were going to make changes to improve ourselves. I will share these as we do them.

    As for calling the Amazon admins a bunch of monkeys, if that’s the way you want to take it, so be it. Frankly, I thought it was a moderate jab at an asinine decision on their part to reboot everyone on 1d with less than a week’s notice, around the holidays, with a new non-whitelisted mail server, lack of confirming receipt of said notice, failing to make a phone call to us, and lastly failing to provide integrated alerting capabilities for those notices through something like CloudWatch. I chose not to rant about it in the post because I’d rather take action and SOLVE the problem moving forward. Bitching doesn’t help, but it does accelerate decisions.

    Frankly this talk of how all us guys downstream have to work around and own these ‘cloud’ issues is just ridiculous. In all my years doing this I’ve NEVER been in a data center where 100% of my infrastructure was rebooted intentionally.

    Loggly spends over $15K a month on AWS for boxes that are equivalent in capabilities to bare metal ones I could buy 4 years ago (for less) and rack myself. If AWS is really a ‘cloud’ service, why shouldn’t its customers be immune to all the old school failures, including mass reboots or drive failures?

    Expectations. They are what you make them.

  • Graham Bleach

    Graham Bleach 20 Dec, 2011 02:51am

    I’ve seen mails from Amazon get marked as spam when there’s recipient rewriting before it passes through gmail. Worth checking the mail headers closely.

    Note that they also email you about EBS failures, so I’d suggest a thorough check of your spam folders.

  • Alex

    Alex 19 Dec, 2011 05:54pm

    Hi Kord,

    Thanks for your honest post. Appreciate it.

    I hate to kick somebody who is already down, and its certainly not my intention to belittle your excellent service or team, but I submitted several tickets on Sunday which I was extremely surprised were not responded to – do you guys not even monitor ticket subject lines for critical issues out of hours?

    The first of these tickets noticed that only one of the IPs in the RRDNS pool for logs.loggly.com was actually up (which was causing me pain testing a new deployment), and was submitted at about 5pm Central on Sunday. I would be amazed if this turns out to be coincidence, but I guess it is possible.

    The second of these tickets, submitted 11pm central, reported a total outrage on my subdomain.loggly.com. I went to bed, got up and an hour after I had started working I noticed you twitter something saying ‘somethings up with our subdomains’.

    This did kind of surprise me, as it was over 9 hours after I had put a support ticket in, as a paying customer, with subject “Error connecting to xxx.loggly.com”.

    I even sat on Sunday evening F5’ing alertbirds (which was 503ing), hoping that I would push the number of 503s over the critical threshold for your monitoring systems to go off :)

    Hope you guys get some sleep and sort everything out, but i’m interested to know as part of your post mortum on the level of support that you are offering going forward.

    —Alex

    PS – you can more than make up for this (to me!) by implementing parsing of logs as you receive them, so I can search on field x of syslogs (as you can if I send you JSON logs – which I really dont want to do when all I need is to split a given input on a separator).

  • David Lanstein

    David Lanstein 20 Dec, 2011 01:13pm

    Alex,

    Good news first – we committed JSON-over-Syslog, and it should be live late this week or early next week. You’ll probably need syslog-ng 3.3.0 to take advantage of the native JSON templating, but it will let you query on facility, severity, etc.

    So, we should have sent out a mass email as soon as we realized how widespread the issue is. To clarify on our email support, we rely very heavily on internal and external monitoring, and because I’m the only one handling support full-time, we aren’t able to offer 24/7 support by email yet. As part of what we’re changing after this outage, we’re setting up a phone number that will go directly to our on-call system that paying customers can reach 24/7/365 in critical situations. That said, support tickets currently also go to the engineering team, who, although they didn’t reply directly, did post on @logglyops and start putting out the fire.

    End of story: we need to send out a mass email if there is ever a widespread outage again. We could have kept you in the loop with a two-minute email. I’m sorry.

    Sincerely,

    David Lanstein
    Chief Evangelist, Loggly

  • Blake Irvin

    Blake Irvin 20 Dec, 2011 06:29pm

    As a former AWS user, I’ve been pretty pleased with performance and support from Joyent. Better for business-critical apps or intense workloads than AWS, even though the learning curve is a wee bit steeper.

  • James Sivis

    James Sivis 21 Dec, 2011 11:12am

    You might want to try externally processed monitoring, so that when your systems go down, your monitoring doesn’t go down with it. Plus, log files monitoring isn’t the best way to go ….and Nagios is really ’90’s technology – you might want to check out something like Circonus. It’s by the OmniTI people.

  • Mike Horwath

    Mike Horwath 22 Dec, 2011 12:32pm

    Wow – an outage within the cloud…

    I have been using LogicMonitor for our external monitoring and measurement service for the last 3 1/2 months monitoring almost 200 hosts. Only outages so far have been either planned for or for movement of our services onto different hardware.

    Loggly had been on my list but LM eventually won my heart as I have a single pane of glass to look at response time, SNMP measurements, and API compatibility with many of the vendors I use for hardware including NetAapp and VMware.

    Good luck Loggly folks!

    Outages suck and everything fails, sometime. Your recovery will speak volumes to your customers.

    (BTW: I don’t work for LogicMonitor – just a satisfied customer. If you talk to them, tell them I said HI. I receive no kickback or fees from LM in my posting, nor will I accept any payment from them for same.)

  • JakePeacock

    JakePeacock 3 Feb, 2012 12:50pm

    Use Cassandra, it is built for multiple datacenter replication.

Share Your Thoughts

Blog Categories

Search

Loading

Archives by Month