Big Data Gets Bigger
Edited on October 14th, for 2 orders of magnitude bad math.
Big data is big news. Big data is a big problem, and big solutions for it can drive big revenues. Because big money is involved, more and more people are writing and focusing on how big of pack-rats we’ve become. There’s only one fact everyone seems to be missing: Big is relative, after all.
Big Data in the Past
Back in the 70s when I was a kid, my family’s oil business had one of these old clunky Burroughs which my mom not-so-fondly called Maribel. Whenever you wanted to invoice someone, you would load Maribel up with the customer’s account history from paper tape and then manually enter the new invoices. When the existing tape got full, you started a new one. The tapes were yellow, about an inch across and maybe 20 feet long.
We stored these tapes in envelopes, and the envelopes were in turn stored in vertical file cabinets. The hall outside my mom’s office was lined with these files cabinets and the cabients were literarily overflowing into the kitchen because there was no more room in the hall for them. If you estimated 5 bits per line, 72 lines per foot, and 20 feet of tape, that would give you roughly 1KB of storage on a single tape. Multiply that by 1000’s of these tapes and I figure we had a total of 1-2MB of data stored in about 100-200sq/ft of space.
Lots of customers, lots of tape, lots of work, and lots and lots of data. At least lots for 1976.
Your Future Arrived Yesterday
In 1996 my future had arrived. I was running a moderate sized ISP, and found myself buying a full-height 5 1/2" 8GB drive from Seagate for my news server. It cost me just over $2,000. With that one drive alone, I could have stored nearly 300 football field’s worth Maribel’s yellow tape based data.
Just last weekend at Lucene Revolution I gave some company my email address in exchange for a 8GB USB drive. I promptly tore it apart and extracted from it’s guts a sliver of a micro SD card. I could easily fit a few thousand of those cards in the space of that old clunky Seagate drive.
Earlier this year an article in Wired quoted IDC as saying, the size of the information universe in 2009 was 800 Exabytes. IDC went on to say 2020’s information universe was expected to be a staggering 35 Zettabytes; nearly 44 times as much data as there is in existence today.
For reference, one Zettabyte = one thousand Exabytes, one Exabyte = one thousand Petabytes, one Petabyte = one thousand Terrabytes, and one Terrabyte = one thousand Gigabytes. That means a Zettabyte = a million million Gigabytes!
That’s around 3 × 10^16 times as much data as we had in our office in 1976! If we decided to store it in file cabinets filled with yellow tape, our dystopian future’s 35ZB of data would take up the surface area of 546 earths. Say what?
It reminds me of something you’d see in a Douglas Adams novel, where a thousands of small, slightly cranky robots named Maribel are forced to shovel and store yellow tape rolls until they collapse into a pile of rust several millions years later.
Smell the Data Exhaust
Data exhaust can be defined as the machine events generated when a user accesses data stored on a system connected to the Internet, such as when a user access their photos on Flickr. Hadoop Karma indicates Flickr was storing 4 billion photos by the end of 2009. In aggregate, those photos are stored on thousands of servers and are being viewed by millions of users across the globe everyday.
In a simple senario where all the photos on Flickr were viewed once each by a single user, the logs would weigh in at just over 2TB! In reality, Flickr’s log volume probably exceeds a Petabyte or more a year for just the views of the lightbox pages alone. Facebook’s numbers are even scarier. In one month they’ll store 2.5 billion photos on their system. In turn, all the people viewing those photos will generate an order of magnitude more log data than Flickr even has in all the photos they’ve ever stored.
Even though we’re in private beta at the moment, we’re already seeing combined log volumes of around 3GB a day from 15 customers. A few of our customers, including About.me and Server Density are sending us near the max of what we allow on the private beta right now. We expect those volumes to go up considerably when we launch the public beta in December, where an average customer could be sending us anywhere from 1 to 5GB a day each. It won’t take long to start referring to our data in units of Petabytes stored.
While demand for storing all those logs is accelerating along with all the data being generated, the technology behind the storage and processing of data also continues to accelerate. Within a few months time, the technology we are developing at Loggly will provide companies a way to peek into these large volumes of log data – where they couldn’t before – and allow them to see exactly what their users are doing with all that big data.
Loggly’s features for search, reporting and map reducing will make dealing with these huge volumes as trivial as stuffing a yellow punch tape into an envelope, except we don’t need a robot named Maribel to do it.