Category Archives: Opinion

It’s Time to Kill the Elephant

Google started using MapReduce about 10 years ago.  Somewhere between there and now, Doug Cutting decided that he could copy it while at Yahoo and Hadoop was born.  Doug now works at a company named Cloudera who bills themselves as providing the “only solution that manages Apache Hadoop across the enterprise.”  Hadoop has been around for so long that even leading analyst firms are covering it, claiming that if your organization is an early adopter, you need to be looking at Hadoop.  Hear that Luddites?  Time to get moving.

Hadoop Is Picking Up Speed

MAYBE THERE’S A REASON FOR THAT

Recently, Google announced their move away from batch based MapReduce to something a little more real time.  Seams like it was taking days to update search results with something that you might be interested in.  Google never open sourced their implementation of MapReduce, which is said to be at least one or two orders of magnitude faster than Hadoop.  But still not fast enough.

EVEN YAHOO IS GETTING INTO THE ACT

Yahoo used to have a substantial relationship with Cloudera, at least according to Cloudera.  But now even Yahoo have started a company to distribute and support Hadoop.  Yahoo calls their company hortonworks.

WHAT THIS MEANS TO YOU

Without getting into things like how much data and corresponding analysis you need to do before Hadoop makes any sense to use at all (most companies are not going to see any benefit at all), let’s recognize something.  All of these recent shifts from companies like Google, Yahoo, and others no longer see a competitive advantage in batch based MapReduce.  The future has arrived, let’s look at some evidence.

REAL TIME HADOOP

MapReduce

There have been more than a handful of releases in this space – like S4 from Yahoo, HStreaming, Storm, and several NoSQL databases now supporting this, it means that for competitive advantage, you’d best be getting some real-time.  And getting it soon.

WHAT IS REAL-TIME?

Database vendors like DataStax, who support Cassandra, claim to be real-time.  They’re not.  They say that they’re real time because as soon as you commit data to the database, it’s available for query.  That’s supported by just about every database and hardly a new and exciting feature of NoSQL.  Even one of their big shots left to start a real time company named Platfora.

CONTINUOUS QUERY OR EVENT-DRIVEN

Rather than thinking about what real-time is or is not, let’s worry about event-driven.  Let’s use an example:

I’m a manager, and I want to know when the average time on my website dips below 2 minutes.  Using the ‘my database is real time because the data I send to it can be queried after I write it’ means that I would have to run this query repeatedly at regular intervals to catch this mounting exodus from my web properties.

THERE’S GOT TO BE A BETTER WAY

And there is, it’s called continuous query.  I ask the same question as above, and there’s some process somewhere that’s sessionizing data from my web logs and injecting that into that server – the same server that I sent the query above to.  And when that process finds a web session that lasted less than 2 minutes, it sends another ‘row’ to the program that submitted that query.

ABRACADABRA

Waiting for Hadoop Query

And then I’ve got it on my dashboard, and can switch out the really badly designed page the marketing department A/B’d this morning.  That’s continuous query, or event-driven.  The term real-time didn’t even need to be mentioned.  If I was running batch based Hadoop, that notification could have taken hours, or days.  How much money would your company lose if that happened to you?

BACK TO MAP/REDUCE

I am Node of Cluster...

So if I can do the above, why do I need MapReduce?  MapReduce is an algorithm for splitting work up, distributing the work out to nodes where the data lives that needs to be analyzed, and then gathering the results.  If you’re problem is big enough, MapReduce might help you get it done faster than using just one machine.

BUT EITHER WAY

If you’re running batch processes, like some well known web properties are and think that Hadoop holds an answer to your ever dwindling ad revenue, you’re mistaken.  And if you’re that CIO, the other thing you need to be working on is most likely your resume.

GET YOURSELF SOME CONTINUOUS QUERY, AND GET COMPETITIVE!

and thanks for reading!

Predictions for 2011

Some predictions for 2011.  In no particular order or importance.

1. CEP – The Feature

There’s a couple of things going on here.  The most important being that Mark Palmer is writing blog posts about Richard Tibbetts writing blog posts on the Tabb Group’s site about writing better software on Wall Street (because software startups write better code and deal with bigger problems than firms on Wall Street, especially exchanges, do, right?…).  No customer win.  No ‘yet another use for CEP.’  Just good old fashion buzz-word copy designed to remind people that there’s still one stand-alone CEP vendor. (click click click, is this link working?)

CEP will become a feature of larger, more established, horizontal offerings.  Because once the opportunities in Canada and South America dry up (you know those hotbeds for financial engineering, right? Canada and South America?  You don’t?) the reality will sink in.  What’s that reality?  That no one in NYC is buying CEP engines for HFT trading anymore.  Why?  The CEP vendors don’t know why.  Even Apama has seen the light.  You can now process events one at a time, or within the context of their CEP engine.  Stunning.  They’re pushing the ‘platform.’  Good.  It’s about time Tibco got some competition.

So this coming year won’t be the year of CEP, it will be the yesteryear of CEP, like, “Remember yesteryear, when we all thought CEP was going to be really hot?”  CEP will become a feature found in Event Processing Platforms.  And we’ll finally start to see adoption of those platforms in large, house hold names.

2. Hadoop & Analytical Databases

People are going to begin realizing that databases that incorporate map/reduce into their architecture will be *no faster* than Hadoop.  Why?  Go buy a book about Hadoop and then sit down with a piece of paper and pencil.  Databases are designed to support multitudes of users all asking different questions.  And vendors would like to have us believe that they’re also all running long lasting jobs taking advantage of their shiny map/reduce implementations.  Long lasting.  As in, not interactive.  As in, batch jobs running on a map/reduce framework.  The bigger the job, the less increase in throughput the analytical database will be able to offer.  It’s physics.  Save your money.  So this coming year, we’ll probably see further consolidation and some unexpected exits in this area.

3.  And Speaking About Hadoop

Batch is dead as competitive advantage.  As Jeff Jonas loves to point out, and points out really well, data velocity is growing.  And the rate at which data velocity is growing is increasing.  And companies can’t process the data they have today.  And many companies are actually making bad decisions with more data, not better decisions, why?  Because the data has lost most of its value once it’s been crunched.  You can only take a batch system like Hadoop so far.  But right now (at least for the near future) you actually still need to incorporate some ideas from the batch world for everything to come together.  So this coming year, we’ll see more people start treating Hadoop as either a must have to compete (minimal cost of entry) or “How f(*)(* much does that cluster cost to run?  There’s got to be a better way!”  It’s time to outsource your Hadoop cluster.

4. Real Time & Batch

I’m not saying that Hadoop or Map/Reduce is a waste of time and money like some vendors who make outrageous claims like, “Google has stopped using map/reduce.”  That’s idiotic.  Please put the kool-aid down, you’ve had enough to drink.  What’s important is the ability to analyze data in flight, to make decisions while there’s still the opportunity to have an impact.  How does one accomplish this?  By having a context in which events are analyzed.  How is context built?  Via the constant processing of events in flight, constructing and augmenting context, and supplementing that context with the result of monster jobs run on gazilla-bytes of data (like that? gazilla- it’s mine).  So this coming year, we’ll see a focus on moving analytics to real time.  (deeper analytics than VWAP-please!)

5. The Big Picture

There have been some really neato-keeno entries in the visualization space.  Things like Tableau, Spotfire, etc.  But they’re great for analysis of relatively static data sources.  They’re not for real time stuff.  Even offerings from vendors like Panopticon, which can provide some insight into multidimensional data sources updating in real time, really offer quite limited analysis tools for big data.  So in the coming year, we should see more focus on real time data mining.

6.  Did He Say, “Big Data?”

Yup.  I said it.  I’ve heard people define big data as more data than will fit on one machine.  Those people haven’t worked on the machines I’ve worked on.  I define big data as “when you can’t turn your data into actionable intelligence fast enough to have an impact during the window of opportunity.”  Or something like that.  I’m not in marketing.  Common patterns for analyzing big data are emerging that tie as-it-happens analysis with context and historical data.  The lines are blurring.  It’s all just becoming data.  And business wants it all.  Even my father-in-law, in Germany, asked, “Ja, SAP wants me to store my data in ze cloud.”  Everyone knows about big data and 2011 is going to be all about it.  Getting it.  Storing it.  Analyzing it.  Visualizing it.  And then we’ll see the real emergence of privacy issues, like what happens when our respective governments start using simple tools like the ‘People You May Know’ from Linkedin during child pornography investigations?  It’s going to happen.  Our government isn’t ready.  The legal infrstructure isn’t there.  There will be the formation of chaos in this regard in 2011.

7.  Computer Network Attack Platforms (CNA)

In 2011, something important is going to happen.  The general populace will be made aware that, in addition to all the traditional, ground-based, we can’t win engagements in places like Afghanistan and Iraq, that we’re also involved in a different kind of war.  One that rages on every day, one that runs 24×7 and involves facets of technology present both on earth and in space and in the Internet.  And that’s cyber warfare.  World War III is here already – China has already ‘stolen’ the Internet from the United States for about 15 minutes, diverting the majority of our most important Internet based traffic through their country for storage and analysis.  Doesn’t that raise an eyebrow?  It should.  One of the biggest things that happened in 2010 went off without a hitch and without a great deal of coverage.  Iran’s nuclear power plant, the one everyone was afraid of, was rendered inoperable by a virus.  Because that plant is inoperable, Russia is going to continue to make big money.  And Israel is going to sleep easier.  Odd bedfellows, wouldn’t you say?  We’re going to learn more about CNA’s in 2011.

7.  What about NoSQL?

Yawn.  It’s going to be a TWO BILLION DOLLAR market.  Just like CEP was.  Really.

Happy New Year!

What's Wrong With Complex Event Processing?

I spend a significant amount of my time keeping up with advances in processing high velocity big data.  Over the last year, I’ve watched the NoSQL camp grow a lot.  And now, some folks are even forecasting a market approaching $2 Billion USD by 2015. The last time I saw that kind of trajectory for a new software category was for Complex Event Processing.  So without casting any undue aspersion on the NoSQL camp, let’s talk about why CEP has so dramatically failed to generate the returns venture capital firms were so sure they were going to achieve.

WHAT IS EVENT PROCESSING?

Event Processing, or Event Driven Architectures, means nothing more than processing an event one event at a time; preferably sometime shortly after they occur.  The opposite of this is Batch Processing, which means batching events, or messages, or what most of the world would call a row, of data and processing them together.  In batches.  Sounds simple enough, right?  All of you reading this blog post have used an Event Driven Architecture.  In fact, you’re using one now – it’s in your browser.  Can you imagine what the user experience would be if your browser ‘batched’ up all of your mouse clicks and submitted them every 30 seconds?  Event Driven Architectures promise the same type of agility and increased user experience for line of business and consumer applications that you’re experiecing right now.  In fact, it’s probably hard to think about using the web in batch mode – it just doesn’t make sense.

WHAT IS COMPLEX EVENT PROCESSING?

For the most part, a marketing phrase.  That’s right – and again, for the most part, it’s completely meaningless.  As an early and continuing contributor to this particular area of technology, I remember when StreamBase, Apama, myself, and others called this field Event Stream Processing.  Then one of those firms marketing departments decided to differentiate.  I’ll leave the specific firm to your intuition.   So, what is Event Stream Processing?  That’s much easier to answer.  Event Stream Processing is Event Processing with four additional key components:

1. Continuous Query

Rather than having to poll a server for an event, using ESP , the user of the system issues a query and is subsequently informed with events, aggregations, or patterns that satisfy the specifics of the query.  This happens continuously, until you stop the query.

2. Windows (Time and/or Length)

Using ESP, the user can ask, as an example, for an average value of some key over either a time or length window.  Something like, ‘Give me the average amount of time people have spent on the homepage in the last 10 minutes.’  This query would provide an updated average either continuously, or perhaps at regular intervals.

3. Pattern Matching

With Pattern Matching, I’m able to define a series of events that fit a pattern, and then be notifified when that pattern is observed.  Usually within some Time or Length Window.  So, I might ask, “How many users are going from the “Home Page” to the “About Company Page” and then clicking on “My Profile” during a rolling 10 minute window”.”

4. A Language

Tying all of the above together in a neat little language is a cool idea – it makes using these features easier. At least, that’s the theory.  And this is one place where CEP has gone wrong and is not the general computing revolution that myself and others have hoped for.  I’ll expound upon this after a brief distraction in the next 2 paragraphs.  Please bear with me.

WHERE IS COMPLEX EVENT PROCESSING USED?

Even Mark Palmer, who is usually extremely bullish about CEP and probably sprinkles it on his breakfast cereal, has recently admitted the CEP is only hot in Capital Markets.  While I might disagree a bit with Mark, which is nothing new, I think we can all agree that CEP, at the $200M total market size is far less than we had all hoped for.  Frankly, it reminds me of the FIX engine vendor battles – I was an early provider there too – and we all ended up fighting over an ever shrinking market place.

WHAT IS THAT MARKET ANYWAY?

The current vendor set of CEP tends to focus on Capital Markets.  But not really.  It focuses on an even smaller slice of Capital Markets called High Frequency Trading.   Seems more people know more about High Frequency Trading today than CEP. The important thing here is the what all the smart analysts are calling the “CEP Market” really isn’t the “CEP Market” at all.  It’s the HFT software market.  And again, looking at the impressively long list of clients that Mr. Palmer has cited in his recent blog post, many of those firms aren’t actually using CEP for HFT, but an even smaller subset of functionality.  That’s why the market is so small – if any VC firm thought the total addressable market for this technology was going to be $200M in 2010, no CEP startup would have received funding.  And when HFT finds the Next Big Thing, the CEP market, as defined today, will evaporate.  And along with it, any CEP vendor who has concentrated solely upon that market.

SO WHAT HAPPENED?

The idea was that we were ushering in a New Way To Compute Things.  Like all technologists who spend way too much time thinking about this stuff, we thought everyone would immediately see how smart we were, run out and buy one of the CEP based products, and join is in revolutionizing how data is turned into information and used by business folk to make money and pay our salaries.  The only problem is, we forgot 2 things; 1) who would be using our software to do this work, and 2) who would subsequently be using the applications developed by 1.

DEVELOPERS – A FINICKY BREED

I used to be a Real Developer – I wrote in C++.  Then Sun decided that the Internet was the Computer and we all started to learn Java.  Java is cool – Java makes it easy for anyone to write bad code whereas C++ really took some effort to mess things up.  More and more people started using Java for everything; servers, clients, web stuff, etc.  And now, I’m not sure what people use anymore – perhaps coders are using NoJava for all of their no shiny NoSQL apps.  I still use Java.  And I’m loathe to learn another language.  See #4 above in ‘What’s CEP?”  I don’t want to learn another language.  And I certainly don’t want to move all of my work; servers, clients, webapps, etc. to a new and unproven language.  And no matter which vendor you don’t choose for your CEP application because you write it all yourself anyway, none of their languages can claim to be broadly or generally adopted.  Proof?  Try to buy a book on one of them.  There are umpteen books out on NoSQL in less time than it took some CEP vendors to go out of business.  CEP vendors have failed to appeal to core IT departments.  Period.  And core IT departments are the folks who have to assemble all the crap they buy from vendors into something that business users get to complain about when it doesn’t work.

BUSINESS USERS – “SHOW ME SOMETHING!”

Business users want to see information.  They want to see information presented crisply; ready for decision making.  And today, more than ever, they want to see it on their web browser, iPad, iPhone, Droid, Apple TV, disconnected lap top, on flat panels on the front of their refrigerator in the kitchen and on the heads up display in their car while commuting to work.  In short, they want information any time they want information so that they can function in what has become, and will continue to become, an ever faster and more connected world.  Even Progress Apama, who I think is doing really well, uses Flash based instrumentation.  No iPad for you!  There is no CEP environment that let’s the IT folks build a complete application for the business user.  So the business user never ‘SEE’S” CEP.  So they’re not impressed.  They don’t get it.  And they don’t provide budget for stuff they don’t get.

AND IN CONCLUSION

CEP has failed to achieve the multi-billion dollar market forecasts that we all went out and raised money based upon because most vendors have failed to provide the education and tools necessary to create the complete user experience.  Most of the CEP vendors don’t even have their own visualization products – they partner with other vendors to provide things like Tree Maps or Dash Boards.  How they expect to revolutionize the world by outsourcing their interaction with the Business User is beyond me.

THE LESSON?

If the NoSQL camp would like to come anywhere close to realizing the crack smoking analysts’ estimates of a $2B market, they should 1) make the technology readily accessible to the IT department (which they’re doing) and 2) make sure that the business users knows why it’s making a difference.  If they can reach out and touch the business user, or consumer, all the better.

AND THANKS FOR READING

Flash Crash – HFT Not To Blame so What Next?

Much to the SEC’s consternation, the recent report detailing the causes of the May 6th, 2010 Flash Crash has failed to indict High Frequency Trading as the cause.

BUT WE WANT THE MONEY

How does all of this tie in with the SEC’s bid to build a huge consolidated audit trail? (CAT) – well, after a lot of thought, I don’t really know. And I haven’t read a lot from anyone that purports to know either. All I know is the SEC wants a lot of money to build it because, well, we need it to watch all of those evil HFT firms out there disrupting the market.

REG NMS IS BROKEN

So here’s a relevant question, if HFT wasn’t to blame for the Flash Crash, then what was? I thought Reg NMS was supposed to ensure that, no matter how many exchanges enter the fray, we were all supposed to get the best price – regardless of where the order executed. And that all of the exchanges would cooperate and there would be peace and harmony in the valley. Guess what, that’s not working.  IT would seem that the ‘structure’ of the market needs a little attention.

IF IT’S BROKEN, WHAT ARE THEY REGULATING?

So, if HFT isn’t to blame for the crash, and it boils down to some idiot shorting billions of dollars of futures in one fell swoop, then something else must be broken. And the SEC’s CAT will fix it, right? Wrong. All the SEC’s CAT proposal will do once it’s live in 4 years after billions of dollars is yet again confirm that HFT isn’t responsible for any other flash crashes as well. So why spend the money?

BECAUSE I WANT AN EMPIRE

Could it be, that the SEC, who didn’t even have the expertise required to answer the basic question, “WTF happened to the market please” could be looking to build an empire? Is this the same organization who seems to dispense different levels of disciplinary action based upon how deep the offending firms pockets are? Say it isn’t so!  Asking for billions of dollars represents a multiple over the SEC’s current operating budget.  Just what exactly do they plan on doing with all that money?  And how many firms have offered to build the capabilities for a lot less?  If the SEC doesn’t even know what the problem is, how do they know how much money to ask for and what system needs to be built?  Madness.

I AM NOT IMPRESSED

I’d really like to hear how the SEC’s going to fix the market’s structure and prevent this from happening again.  Then I’d like to see actual enforcement of the laws on the books and see people who violate these laws go to jail instead of write checks.  In short, what I’d like to see is the SEC start doing their job before asking for billions of dollars to do something that wouldn’t have prevented what happened in the first place.  Please.  Does the SEC and Mary Schapiro think we’re all just that stupid?

THAT’S RIGHT, I SAID IT

So not only am I not going to go on record with the above (go back and read it again – the SEC isn’t doing their job) but I’m always going to say that event processing technology can’t prevent future occurrences of Flash Crash (and CEP vendors who think they can really don’t know what they’re talking about).  Several vendors  (YOU KNOW WHO YOU ARE) are all too happy to not point out that the emperors new clothes, well, need some additional tailoring as they jump on the SEC and CFTC’s bandwagon, hoping for additional, regulation sourced revenue.

THE SEC COULD BE A HERO

I believe that if the SEC started doing their job again, and investor confidence returned to the equity markets, maybe we’d see a return to previous volumes.  Because right now, at the volume levels out there today, there’s going to be blood in the streets as firms collapse under the lack of trading.  All the SEC has to do is their job.  And be 1/10th as vocal about it as their request for billions of hard earned tax payers dollars.

Thanks for reading!

It's Deja Vu – All Over Again

I was recently asked “What problems does CEP solve that cannot be solved with smart coding, a columnar database or a whopping great grid?” via a Linkedin group for Complex Event Processing.  Here’s the link.  I think I understand the question, and if I do, it’s really the wrong question.  Which means of course that my answer is probably going to cause a bit of a ruckus.

WHAT IS CEP ANYWAY?

There are a couple of things that a system needs to say that it incorporates CEP – the ability to continously query an event stream, pattern matching semantics, and sliding windows over data.  For example:

SELECT PHASORID, AVG(FREQUENCY)  FROM SMARTGRID.WIN:TIME(30 MINUTES) GROUP BY PHASORID;

This sets up a continuous query – it listens for events that are named ‘SMARTGRID’ and calculates the average frequency by phasorid.  This is an example of a time based window.  The window always refers to the last 30 minutes.  Another example:

SELECT A.* FROM PATTERN(A=SMARTGRID(FREQUENCY>130) -> (TIME.INTERVAL=30 MINUTES) AND B=SMARTGRID(FREQUENCY<110 AND A.PHASORID=B.PHASORID));

This statement says, “I want to know when a phasor measures a  high frequency followed by a lower frequency.  This is an example of pattern matching.  Both examples demonstrate a continuous query – those queries just keep running, returning a continuous stream of results.

THAT’S A LOT OF CEP

I’ve been focused primarily on solving problems using CEP for the last 6 years, and using event processing for the last 20+.  So, I’ve seen lots of examples of the right use of CEP and some wrong uses of CEP as well.  We’ll talk about the right uses here.  Mark Piper asks about, “smart coding.”  Let’s answer that one first.

SMART CODERS

Sure, smart coders can get along without CEP.  Just like they can get along without a database.  Or a messaging bus.  And if you’re a really smart coder, you don’t even need an operating system.  You’ll just write all of those using your copious amount of spare time that your employer provides you because you don’t have any deliverables that are time sensitive.  And since you don’t have any customers, you can do whatever you want anyway.  Right?  Wrong!

There’s a difference between an implementation of CEP and CEP itself – maybe you want the whole development environment ala StreamBase, Apama or Sybase.  Maybe you want an API.  That’s most likely up to your individual/team coding style and whether or not you actually buy into the, “Use our platform for everything – it will save you time and money!” argument.

But the gist of this argument is that if you want to have a CEP source of functionality to use when solving problems by writing code, you probably want a general library, or system, that you can use to do so instead of writing it yourself.  And I’ve had the privilege of working with some very smart coders over the years and it took even them slightly longer than a month or two to build something that could be used in a number of different scenarios.

CHEAPER CODE

So that’s the tech side.  What about the business side?  That’s even easier – I don’t want to hire a bunch of C++ coders to write infrastructure – unless that’s my business.  And most businesses don’t fall into that category.  A lot of big financial firms do get a bigger bang for buck by writing their own stuff – they’ve got the staff (because they can afford the really smart coders it takes to build this stuff), they’ve got performance requirements that are extremely stringent and it’s actually cheaper for them to build vs buy given scope and breadth of deployment.  So, unless it’s your business to build infrastructure by using very expensive coders, it’s cheaper and just as effective to use third party libraries or systems.  And that’s just not for CEP but generally applicable across the board.

COLUMNAR DATABASES AND GRIDS?

Not sure if I understand the last part of Mark’s question, but I’ll take a stab anyway.  Mark mentions grid – and the use of grid vs stream processing typically finds itself used in different parts of the same problem.  For example, if I’m running a big credit derivatives trading operation, I’ve probably got a grid running some fairly heavy compute and updating a shared cache of stuff that the bank needs for a variety of applications.  Then let’s imagine that I’m receiving client orders and want to do something based upon data in that grid, I might just use a system that incorporates components of CEP (notice that I did not say a CEP system here – that’s important) to streamline the event stream processing characteristics of that application using the grid cache to grab parameters for me as they change.  So the use of grid and CEP here is complimentary – not dynamically opposed.  Two different compute problems, two different technologies applied to a very common pattern.  There are similar patterns involving the use of CEP for data-in-flight and columnar, or any database really, for historical data.

SO IN CONCLUSION

I recommend the use of existing code over writing new code just because you’ve got smart coders if it makes sense and is economically attractive.  And, just like in the 90′s when we were selling EAI applications, all the C++ guys whined back then, complaining that, “We can do it better and faster.”  While that may have been true (very few ever actually proved that to me), it was usually far more expensive and very brittle.  Tools exist for a reason – it’s what separates us from the animals.

AND AS ALWAYS

Thanks for reading!

Analytical Platforms, Databases, & CEP

This from the book, “Hadoop – the Definitive Guide,”

“This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis system.”

BUT I THOUGHT IT WAS ABOUT BIG DATA?

It is, but Hadoop is not designed, at least today, for anything other than write once (and now maybe append), and then analyze many times over and over again.  Databases are traditionally better suited for applications that need to both read, write, and update data.  That is until databases like columnar arrived on the scene.  Thank you Michael Stonebraker, et. al.

STONEBRAKER, AGAIN?

Yup, a couple of years ago, Mr. Stonebraker said that Map/Reduce was  a major step backwards.  That’s when he was actively involved with Vertica.  Just recently, he said that data warehouses are too big for in-memory approaches.  It would appear that his opinions have evolved along with technology.  Hmm.  Can you have it both ways?  I think not. So?

AGAIN, WHAT’S YOUR POINT?

My point is that analytical platforms are different than databases, and different than data warehouses.  And if we look at databases today, we can make a broad distinction – those databases with in-db map/reduce and those without.  Aster Data, Greenplum, and Sybase (planned) on the commercial side fit that bill as do Riak and MongoDB on the free-for-now front.  Analytical platforms need to analyze big data, the faster the better, with deep & complex analytical capabilities, and that lends itself to an in-db map/reduce ‘brother can you par a dime’ (paradigm).

THIS JUST IN FROM THE PROSPECTIVE CUSTOMER (what we call, “The Market”)

Everyone that I’ve been speaking with in the analytical platform market doesn’t just want to store their data and analyze it later, they want to analyze it *now* and store it and analyze it later.  And then again.  Lather.  Rinse.  Repeat.  But unless you’re a vendor like Sybase who has CEP, or Oracle who has CEP, or Vertica who has Streambase, how are you going to offer your customer those capabilities?  (and yes, I’ve mixed traditional database offerings just now with event processing add-ons in a mixed metaphor worthy of Bush himself)

CEP JUST ISN’T THAT MOTIVATING ANYMORE

During all the hubbub during SAP’s acquisition of Sybase, there was nary a mention of CEP.  No, our darling CEP platforms had taken a backseat to the mobile love child that had been in the making for years.

BUT IT’S STILL A TIC MARK

Which means that if you’ve got a database, especially an analytical database, then you’re going to want to combine the loading ability with some type of event processing functionality.  This is just common sense. Or your customer might go buy from someone else.

THE DRUMROLL PLEASE

Seems like a lot of software companies are buying hardware companies.  And hardware companies are buying software companies.  Like HP, GreenPlum, EMC, 3PAR, etc.  There aren’t too many analytical platforms out there.  There are less columnar guys.  But there’s only one pure play CEP vendor left, and that’s StreamBase.  So the question isn’t who’s going to buy which analytical platform or columnar datastore, it’s who’s going to by StreamBase.  And my bet is HP.

AND AS ALWAYS

Thanks for reading.

Event Processing in the Cloud – DataSift is a Big Proof Point

In the past year or so, I’ve heard from many skeptics – people who didn’t believe that Event Processing could be successfully deployed in the cloud.  Granted, most of these folks represented firms actively engaged in providing the High Frequency Trading (Algo Trading) industry with tools.  And in that arena, cloud deployment probably doesn’t make sense.  Yet.

CLOSER TO HOME THOUGH

Ask people in Capital Markets about Twitter and the most common response you’ll get is, “What do people use it for?”  This is because most of the people in Capital Markets can’t use things like Twitter, instant messaging, or Facebook at work and if they can, it’s heavily regulated.  But the point is, that they mostly don’t get it – I myself was included in this camp until a friend of mine explained it to me.  Since then, I’ve taken to Twitter like a fish to water.  My point here is that there are a lot more people in the world who know something about Twitter than High Frequency Trading.

SO WHAT?

I’ll tell you.  This past week we saw the announcement of DataSift; there’s a great video of DataSift in operation here.   I’m really impressed with Data’Sift’s capabilities – and although I don’t think their filtering capabilities incorporate CEP as some have claimed (I didn’t see pattern matching or windows in the demo) and I strongly disagree with Nick Halstead‘s claims that their offering is the only one out there with these capabilities, I think DataSift proves a very interesting point.

AND THAT IS?

Twitter is probably the closest thing that we have that embodies the event cloud (via a single source anyway).  Ask 20 people what Twitter is and you’ll get 20 different answers.  My answer would be, “a living, breathing, consciousness – what’s the world thinking about and doing right now?” And in that regard, Twitter is an “event cloud.”  And DataSift is querying and filtering that “event cloud” in real time – providing relevance, or extracting items of interest.  To you, right now – not tomorrow in a report on your desk or in a daily email digest.  Right now.  And they’re using an event driven architecture to do it.

THE CONCLUSION

DataSift is processing the entire Twitter fire hose and although statistics are hard to come by in terms of what that means,  Mr. Halstead is readying the platform for release at web scale.  So , that’s big data in, and ‘sifted’ data out, to potentially millions of users and all simultaneously.  I think those who claimed, “The cloud is too slow for event processing,” might be eating some crow soon.

THE SINGULARITY IS APPROACHING

And you either get it, or you don’t.  And if you do, the question isn’t whether or not you’ll use tools like DataSift, but how soon.

AND AS ALWAYS

Thanks for reading.

Data Mining in Streaming Data – CEP & SAX

In the last couple of posts, I’ve outlined a method for both reducing the dimensionality of continuous data and also reducing it to symbols to make further analysis easier. The method we’ve been using is referred to as Symbolic Aggregate Approximation, or SAX.

STREAMING SAX

The examples that I’ve shown so far have been illustrated using Excel. But if we were serious about using SAX in a real world scenario, we’d most probably be processing some type of streaming data. SAX has application anywhere there’s a bunch of highly dimensional, continuous data being generated. But we’ll stick to stock market trade data for now.

I went out and purchased a month’s worth of IBM trade prices & volumes from the NYSE – it’s very easy to do, and you can do it here. Once I did that, I loaded the data into a MySQL database and prepared to process it within DarkStar, our distributed event processing system that uses components of streaming map/reduce and complex event processing.

BEFORE WE GET STARTED

In the examples I outlined earlier, I took an entire day’s worth of data, normalized it, and then applied piecewise aggregate approximation to it, dividing a trading day up into 7 roughly equal samples. Now that we’re going to process the data as it streams out of the exchange, how should we break things up? The answer depends upon the question you’re asking. If there’s a pattern you think shows itself every 10 minutes and consists of 10 discrete values, then we should sample 10 minutes worth of data and break it down into intervals of 1 minute using the techniques shown earlier. So, the first thing we’re going to do is create a named window. A named window is going to provide the data we need in a 10 minute, sliding window.  We describe the window like this:

CREATE WINDOW winTradeData.win:time(10 minutes) as select * from tradeEvent;
INSERT INTO winTradeData select * from tradeEvent;

What these two statements do is to, 1) create a sliding window that contains the last 10 minutes of tradeEvent events, and 2) inserts tradeEvent events into that window as they arrive.  The first statement creates a named window that has all of the fields from the tradeEvent event.  The second statement populates the window.  So far so good.

WE’VE GOT A WINDOW FULL OF EVENTS, NOW WHAT?

Well, we’d like to break down the window into 10 equal segments of 1 minute each.  And then we’ll average and classify the 1 minutes segments.  But before we can classify the data, we need to normalize it.  We want to do this every minute; we don’t want to wait and do this every 10 minutes do we?  If we did, we might miss a whole bunch of patterns that started in the previous window and ended in the current window.  So we’re going to pull data from the window and normalize it every minute with this statement (I call this a ‘tumbling window’):

SELECT symbol, (price-avg(price))/std(price) as normalized_price FROM winTradeData output every 1 minute;

PICK A LETTER, ANY LETTER

We’re going to apply PAA to this resulting data set, (see earlier post), PAA will give us an average value for each time slice within the interval that we’re analyzing.  In this case, it’s 1 minute long.  So we want to average all the trades for a 1 minute period and then look up the corresponding SAX letter.  We could write another query to accomplish this or perhaps modify the one above.  Once we have the averages, we can assign a letter and then we’ll have a SAX word.

SO WE’VE GOT A SAX WORD, NOW WHAT?

Now that we’re able to describe streaming data in a discrete way, with a lower bounding function, we’re ready to do some more things.  From an earlier post, I said that SAX could be used for clustering, classification, anomaly detection, and search.  We’re going to focus on search in the next post.

UNTIL THEN

Think about how this algorithm lends itself to a map/reduce (via Hadoop or via in database map/reduce) implementation and how we’d use SAX then to correlate streaming data to historical data – there’s a lot in this blog that talk about this, perhaps not in terms of SAX, but there’s work in map/reduce, inverted indexes, etc.  We’ll need all of that, and a little more, to string it all together.

THE NEXT POST

Will happen sooner than the last, I promise.

THANKS FOR READING!


Why I Love the Cloud Today – Up & Running

How many times have you thought to yourself, “Self, I’d really like to take a look at that wonderful, does everything that I need, server-based product” only to realize that you don’t have a machine, and if you did have a machine, you don’t have the OS because the product likes to run on RHEL and you only have Ubuntu laying around.  Sure, you could download an ISO, burn a CD, find a piece of hardware that has enough memory and disk on it (oddly enough, all of those machines in a development shop seem to be occupied..) and get going.

BUT WAIT, OPERATORS ARE STANDING BY

I ran into exactly this situation today.  I was cruising the ‘net – looking at analytical databases in conjunction with a project I’m working on and came across Greenplum’s offering.  I’d heard a lot about it, but I thought, “Oh, there will be endless meetings with pull-the-string sales guys spouting pre-recorded messages wanting to know why I wanted to use their database before they’d let me look at their software; and only after filling out a number of legal forms.  All of which will require legal review.”  But no, Greenplum had a Single Node Engine available for download.

EUREKA!

The website maintained my interest, something that’s harder and harder to do these days, and abracadabra, I received an email with links to download their database.  I excitedly clicked through, only to find that the database ran on an OS that I didn’t have handy.

PRACTICING WHAT YOU PREACH

I spend all day either talking, tweeting, or writing about elastic resource.  And when I’m not doing that, I’m probably writing code.  And then it came to me, in a flash – “Hey stupid, maybe Rackspace has a Centos5.5 image ready and willing?  Well, RackSpace did – and I was up and running with Greenplum’s database humming along contentedly, waiting to do my bidding.

SURE, BUT HOW LONG DID THAT TAKE?

About an hour.  Seriously.  Oh, and I spent about a dollar.  Really.

SO IF YOU DON’T THINK THE CLOUD CHANGES EVERYTHING

You’ve either been living under a rock for the last 2 years, or like me, spend all day talking about it and forget that this stuff actually works!

AND AS USUAL

Thanks for reading! I appreciate your time.

Why I Love the Cloud Today – Easy Peasy FIX in the Cloud

We’re working with a customer who’d like to send us information using the FIX protocol.  FIX is used in electronic trading for sending orders and receiving executions from brokers, ecn’s, and exchanges.

DARKSTAR SPEAKS FIX

DarkStar, our cloud based, distributed event processing engine that incorporates streaming map/reduce and complex event processing, speaks FIX.  We use the QuickFIX open source FIX engine.  You can find it here.  It’s free.  We include this as a standard OnRamp (OnRamps are used to inject information into DarkStar) and we don’t charge for it.  We’re the only CEP vendor that includes FIX support for free.

DEPLOYMENT (GENTLEMEN, START YOUR STOP WATCHES..)

We have a standard OnRamp image.  One simply logs into our cloud, and deploys another virtual machine using the OnRamp image and our customer gets a dedicated VM to handle their FIX connection to their DarkStar cluster.  The OnRamp image already knows how to inject events into DarkStar, so we set a couple of configuration settings for the FIX engine, and we’re ready to start testing.  Really, that’s it.  The client’s FIX messages (events) are now ready for dynamic, CEP based query and streaming map/reduce style analysis.  Total time?  Less than 5 minutes.

CLOUD ISN’T NECESSARILY SaaS

If your SaaS doesn’t leverage elastic resource (like just spooling up a VM and instant-presto-change-o it’s available to do work), then it’s not really cloud based.  So while your can certainly make applications available via the cloud, taking the necessary steps to utilize elastic resource can have a fantastic ROI.  Like I pointed out above – a new FIX connection, running on a dedicated VM in less than 5 minutes.

AND AS ALWAYS

Thanks for reading!