I’LL HAVE THE DATA SOUP PLEASE
We had data yesterday, we have data today, and we’ll have data tomorrow. In fact, we’ll have a lot more data tomorrow. We’re adding data to our lives at what seems to be an exponential rate.
I belong to the Low Latency, Big Data group on Linkedin, here’s the link – Low Latency, Big Data. There are some heavy hitters out there and we’ve been working together on a definition for big data.
WHAT EXACTLY IS BIG DATA?
There are a couple of working definitions in progress in the group, but the one I like that seems to be emerging really doesn’t have that much to do with data at all. It’s about architecture.
THE SCALES OF BIG DATA
How about scaling to handle all this data. I think one of the core tenants of big data is that it doesn’t fit on one machine. Or maybe a better way to say that is that it won’t scale on one machine. A seemingly natural progression to this logic is that if you want to play with big data, your architecture has to scale.
GRAVITY IS HEAVY MAN
Data has gravity. Hadoop, Yahoo’s slow implementation of Google’s Map/Reduce typically runs on top of HDFS or Hadoop Distributed File System. And when you run Hadoop Jobs, the code is shipped out to the nodes in a Hadoop cluster and run against the data. That scales – it scales because if you want to store more data and run more processes against it, you just add nodes. For the most part, this means that Hadoop is a shared-nothing architecture. That’s important if you want to scale. (go read about it elsewhere, then see the error I made above, then continue reading). Running code on the node with data means you don’t have to ship the data around the network. See the gravity?
BIG DATA DOESN’T REWARD SHARING
It would seem that by sharing nothing between nodes (things like state), we can run processes in parallel. Running things in parallel against the same size data instead of running serially means we should get done sooner. That’s fairly obvious, right?
WHAT’S NOT OBVIOUS THEN?
Big data has nothing to do with data. Big data is the advent of grid and parallel processing. Big data represents the democratization of tools and processing power (cloud) made available to anyone with a credit card. Grid and parallel processing have been around for a while. Elastic resource is relatively new. Free software for VC’s to pimp is almost entirely new. (we call that ‘open source’)
VOLUME, VARIETY, VELOCITY
Whatever.
THE TAKE AWAY
The take away here is that if you’re a vendor and your solution only runs on one machine, you don’t scale well enough to handle big data. Since big data is really about scaling out. If you haven’t solved this not-so-easy problem, then using the phrase big data in your marketing is nothing more than a big lie.
AS ALWAYS
Thanks for reading. And Happy Holidays!
I disagree with you on couple of points :
1. “There are a couple of working definitions in progress in the group, but the one I like that seems to be emerging really doesn’t have that much to do with data at all. It’s about architecture.”
Big Data has always been about huge amount of data , specially unstructured or semistructured by nature. Challenge till date was how to consume that huge data as most of the enterprise valuable content lies in that basket (almost 80-90%). Big Data Processing is about the architectural nuances and how to apply the different architectural options to process that in small amount of time.
2. “How about scaling to handle all this data. I think one of the core tenants of big data is that it doesn’t fit on one machine.”
Mainframes , Supercomputers were always there. The term got coined recently. I can have a mainframe systems to process some tera byte of data. I need not scale unless I am short on dollars. Big Data processing is all about processing huge amount of data using commodity hardware. Grid computing, as you mentioned , was the first step towards that and with the advent of cloud (infra elasticity), we talk more about processing data on cloud. Its cheap.
Cloud basically have helped us innovate and find not-so-traditional approach to some classic problem that we have been facing traditionally. Cloud based map reduce infrastructure using a distributed framework like Hadoop or GFS is just one of the solution to it.