Loggly

Close

If you don't know the subdomain for your account, you can retrieve it by resetting your password. If you don't have an account, signup now.

Getting Your Product Sticky

Posted 13 Jan, 2012 by Kord Campbell in Log Management and Startup

A few months ago I made an off-the-cuff remark about Loggly. "We're like one of those shitty solar powered calculators.  When it gets dark, we forget everything you've typed into us."

That comment wasn't far off the mark.  Historically, we haven't provided a whole hell of a lot of features that makes it easy to jump back into where you left off on your last search session.  Basically when you logged out of Loggly, or even closed the shell in your browser, we'd forget everything you searched for until that point.  It made it extremely difficult to get back to something meaningful the next time you logged in.

We shouldn't be here if we aren't meaningful.  We should deliver users a 'punch in the gut' feature that makes a lasting impact.  One they don't want to avoid.

Saving Time with Sticky Features

Scaling search for a massive amount of log file data being sent in from thousands of machines has been an overwhelming non-trival problem to solve for us over the past year, and it's been our top priority.  Unfortunately us solving scale issues aren't readily obvious to users.  Users always expect things on the web to be fast.  They could care less how hard a problem it was to solve.  They don't think to themselves, "Wow, that's fucking fast!".  No.  Instead they sit around and mutter things to themselves like "Why the hell doesn't feature X do Y?  This thing is wasting my time!".

And there it is laid bare: Don't waste your user's time.  It's the most valuable resource they have.  Get to the point quick, make it easy to get back to what you were doing next time, and do it all with little fuss and muss.

I say give them sticky features!

Saved Search and More

And so, without further ado, I'm officially announcing one of many-to-come new sticky features: saved search.  Saved search provides users a way to write a search query and then preserve the search to run again later.  Saved searches can generate facet graphs or they can simply run a regular search across a given time range.

The shell has been reworked to provide context changes to use with the saved search feature.  You can now change the date context, or limit the context to certain inputs, and then rerun the search or graph using the red rerun button at the top.

Here's a quick screencast running through some of our new sticky features:

 

Coming Up Soon

We're continuing to add features that increase stickiness to the product.  Next week we'll be releasing a revamped history feature for the shell page, where what you've typed in before in a session will be preserved in your command history, just like it would in a normal shell prompt.  We're also adding customized graph selection on the main dashboard, which will allow you to start viewing events that matter most to you by default when you first log in.

All these featuers are leading up to a major revamp of the way we provide value for our user's events.  Expect completely customized dashboards for server monitoring, website performance, user analytics, and more soon!  If you have a feature you'd like to see us implement, please do drop us a line.  We're keen on not wasting your time!

No Comments   |   Leave a Comment   |  

Loggly's Outage for December 19th

Posted 19 Dec, 2011 by Kord Campbell in Business and Startup

Sometimes there's just no other way to say  "we're down" than just admitting you screwed up and are down.  We're coming back up now, and in theory by the time this is read, we'll be serving the app again normally.  There will be a good amount of time until we can rebuild the indexes for historic data of our paid customers. This is our largest outage to date, and I'm not at all proud of it.

So What Happened?

Sometime yesterday afternoon ALL of our machines on Amazon's East region, availability zone 1d, were rebooted by AWS staff.  Originally we stated we had not received reboot notices from Amazon, but the truth is that (4) of the staff here, myself included received two separate vague notices, one from about 10 days ago, and another from 3 days ago, which stated 'some or all' of our instances were scheduled to be rebooted.  These notices were found in our spam folders on Gmail, placed there with a very large red notice reading: "Warning: This message may not be from whom it claims to be. Beware of following any links in it or of providing the sender with any personal information."  Meh.

Loggly uses a variety of monitoring mechanisms to ensure our services are healthy.  These include, but are not limited to, extensive monitoring with Nagios, external monitors like Zerigo, and using a slew of our own API calls for monitoring for errors in our logs.  When the mass reboot occurred we failed to alert because a) our monitoring server was rebooted and failed to complete the boot cycle, b) the external monitors were only set to test for pings and established connections to syslog and http (more about that in a moment), and c) the custom API calls using us were no longer running because we were down.

Combined, these failures effectively  prevented us from noticing we were down.  This in of itself is was the cause of at least half our down time, and to me, the most unacceptable part of this whole situation.

The Human Element

The other cause to our failures is what some of you on Twitter are calling "a failure to architect for the cloud".  I would refine that a bit to say "a failure to architect for a bunch of guys randomly rebooting 100% of your boxes".  A reboot of all boxes has never been tested at Loggly before.  It's a test we've failed completely as of today.  We've been told by Amazon they actually had to work hard at rebooting a few of our instances, and one scrappy little box actually survived their reboot wrath.

While some might go on a rant about how 'normal' failures don't affect 100% of your boxes the truth is that any and everything (including an army of reboot monkeys) can be expected to happen to your servers if you wait around long enough.  The trick to being good at running a reliable service is to architect around any number of everythings that could happen to your service and build for it.

In this case we didn't ever build the workaround simply because the system we run - a combination of 0MQ+Solr+Zookeeper+Loggly Special Sauce - makes it extremely challenging to survive a complete failure with more than 1/2 of the cluster missing.  With other challenges facing us, we decided to live with the risk. Now we're dealing with the fallout of our decision.

So, How Do We Make This Right?

Single instances of Loggly's search cluster can't be spread across multiple availability zones or regions due to the amount of data we push around, latencies between the search nodes, and the lack of support in our system for redundant indexes.  We've been OK with those limitations in the past simple because we normally archive data to S3 when we catch it, and we are capable of rebuilding indexes on the fly if we lose one or more indexers.

The first step in addressing this is to start sharding our customers across multiple Loggly deployments.  This will prevent further outages to the entire customer base.  The second step is to start doing Loggly deployments on dedicated hardware.  Because we keep large amounts of data on our boxes (the indexes) this is pretty much a requirement for fast recovery times when a deployment goes down.  While S3 is AWESOME for backups, it sucks big-time for rebuilding a large amount of search index data.

The second step is to ensure more robust external monitoring.  With multiple deployments, this issue becomes less of an issue, but clearly we need more reliable checks than what we rely on with Zerigo or other services.  Sorry, but simple HTTP checks, pings and established connections to a box do not guarantee it's up!

Finally, we accept full responsibility for the impact to our customers.  We will be in touch with our paid customers sometime over the next week to address compensation for this outage.

We welcome feedback below, and encourage useful criticism of our architectural choices.  All I would ask is that you consider Loggly's infrastructure isn't the same as yours, and I've greatly simplified the reasons for not being more redundant in our deployments.  We can, and will, endeavor to do better in the future.

Kord Campbell, CEO

15 Comments   |   Leave a Comment   |  

Enabling CORS in Django Piston

Posted 5 Dec, 2011 by Ivan Tam in Code

Here at Loggly, one of our goals is to make our API accessible and easy to integrate. By enabling CORS (Cross Origin Resource Sharing) on our API endpoints, we hope more Javascript developers can take advantage of what our product has to offer.
 
CORS is an addition to the browser security model that allows XHR requests to be made from one domain to another. CORS allows Javascript applications to access resources on domains other than the original document's domain, working around the same-origin policy. While Javascript application developers have crafted techniques like JSONP, Flash proxies, XHR receivers, and server-side proxies to circumvent the same-origin policy, CORS makes these hacks unnecessary.
 
To take advantage of CORS both the server and the browser need to support the standard. The browser needs to initiate a negotiation with the server and the server must signal to the browser which domains are allowed to make cross-domain requests. Our current API is implemented in Django Piston, an open-source project that enabled us to quickly build a RESTful API on top of Django. Piston does not support CORS out-of-the-box, but it wasn't hard to write some code to enable it and we'd like to show how it was done.
 
A full explanation of CORS is beyond the scope of this post, but the central idea behind CORS is a negotiation between the browser and server of allowed and disallowed actions. This negotiation is done via HTTP headers. The essential headers are the following:
 
  • Origin: Sent by the browser signifying the originating domain.
  • Access-Control-Allowed-Origin: Sent by the server, listing the origin domains allowed to make requests to the server's domain. Can be a comma-separated list of domains or "*" to allow requests from all domains.
  • Access-Control-Allow-Methods: Sent by the server, listing the HTTP methods the browser is allowed to use in requests to the server.
  • Access-Control-Allow-Headers: Sent by the server, listing the HTTP methods the server is willing to accept from the browser.
Essentially, to enable CORS we need to have Django Piston respond to an OPTIONS request with the server-sent headers and send the requisite headers along with responses.
 
The Resource class is the heart of a Django Piston-built API. The code  that injects the headers into responses lives in a subclass of the base Resource class. We've called this class CORSResource:
 
 
The CORSResource performs two simple tasks. First, it intercepts any OPTIONS method requests to handle the pre-flight negotiation between the browser and the server. Since OPTIONS requests do not have a response body, an empty HTTPResponse() is returned along with the requisite headers. Second, CORSResource intercepts responses from the Django Piston handlers (where responses are generated) and decorates them with the CORS headers.
 
To use CORSResource, we simply instantiated our endpoints with the CORSResource sub-class instead of the base Resource class. The change to our API's urls.py file look like this:
 
 
We hope this post helps other Django Piston API implementors enable CORS in their own APIs. We're planning to release this implementation in the coming weeks and we're looking forward to see what Javascript developers are going to do with direct access to our API.
 
Happy hacking!
 
(image from http://blogs.bournemouth.ac.uk/research/2011/09/01/sharing-your-research-data/)

No Comments   |   Leave a Comment   |  

Blog Categories

Search

Loading

Archives by Month