Google started using MapReduce about 10 years ago. Somewhere between there and now, Doug Cutting decided that he could copy it while at Yahoo and Hadoop was born. Doug now works at a company named Cloudera who bills themselves as providing the “only solution that manages Apache Hadoop across the enterprise.” Hadoop has been around for so long that even leading analyst firms are covering it, claiming that if your organization is an early adopter, you need to be looking at Hadoop. Hear that Luddites? Time to get moving.
MAYBE THERE’S A REASON FOR THAT
Recently, Google announced their move away from batch based MapReduce to something a little more real time. Seams like it was taking days to update search results with something that you might be interested in. Google never open sourced their implementation of MapReduce, which is said to be at least one or two orders of magnitude faster than Hadoop. But still not fast enough.
EVEN YAHOO IS GETTING INTO THE ACT
Yahoo used to have a substantial relationship with Cloudera, at least according to Cloudera. But now even Yahoo have started a company to distribute and support Hadoop. Yahoo calls their company hortonworks.
WHAT THIS MEANS TO YOU
Without getting into things like how much data and corresponding analysis you need to do before Hadoop makes any sense to use at all (most companies are not going to see any benefit at all), let’s recognize something. All of these recent shifts from companies like Google, Yahoo, and others no longer see a competitive advantage in batch based MapReduce. The future has arrived, let’s look at some evidence.
REAL TIME HADOOP
There have been more than a handful of releases in this space – like S4 from Yahoo, HStreaming, Storm, and several NoSQL databases now supporting this, it means that for competitive advantage, you’d best be getting some real-time. And getting it soon.
WHAT IS REAL-TIME?
Database vendors like DataStax, who support Cassandra, claim to be real-time. They’re not. They say that they’re real time because as soon as you commit data to the database, it’s available for query. That’s supported by just about every database and hardly a new and exciting feature of NoSQL. Even one of their big shots left to start a real time company named Platfora.
CONTINUOUS QUERY OR EVENT-DRIVEN
Rather than thinking about what real-time is or is not, let’s worry about event-driven. Let’s use an example:
I’m a manager, and I want to know when the average time on my website dips below 2 minutes. Using the ‘my database is real time because the data I send to it can be queried after I write it’ means that I would have to run this query repeatedly at regular intervals to catch this mounting exodus from my web properties.
THERE’S GOT TO BE A BETTER WAY
And there is, it’s called continuous query. I ask the same question as above, and there’s some process somewhere that’s sessionizing data from my web logs and injecting that into that server – the same server that I sent the query above to. And when that process finds a web session that lasted less than 2 minutes, it sends another ‘row’ to the program that submitted that query.
ABRACADABRA
And then I’ve got it on my dashboard, and can switch out the really badly designed page the marketing department A/B’d this morning. That’s continuous query, or event-driven. The term real-time didn’t even need to be mentioned. If I was running batch based Hadoop, that notification could have taken hours, or days. How much money would your company lose if that happened to you?
BACK TO MAP/REDUCE
So if I can do the above, why do I need MapReduce? MapReduce is an algorithm for splitting work up, distributing the work out to nodes where the data lives that needs to be analyzed, and then gathering the results. If you’re problem is big enough, MapReduce might help you get it done faster than using just one machine.
BUT EITHER WAY
If you’re running batch processes, like some well known web properties are and think that Hadoop holds an answer to your ever dwindling ad revenue, you’re mistaken. And if you’re that CIO, the other thing you need to be working on is most likely your resume.
GET YOURSELF SOME CONTINUOUS QUERY, AND GET COMPETITIVE!
and thanks for reading!








