Standalone CEP is Dead-Long Live the Database

In days of old, when CEP didn’t exist, and we called it ESP, or Event Stream Processing, the whole value proposition that most vendors in the space espoused was, “We don’t have to write stuff to the database to process it.  And that makes us really fast!”

THE VALUE PROPOSITION PLEASE

What made me start thinking about this was all the stirring lately in the in-memory database area.  Hasso Plattner’s (SAP’s chairman) been working on this for quite some time and last week announced some fairly startling news.  SAP is planning on using both row/column, in-memory store for everything.  Because the underlying database is so fast, a lot of stuff that had to be pre-calculated previously, doesn’t have to be any more.  And that can cut down a database size by an order of magnitude.  Small enough to fit in memory. How?  Let me explain.

OTP & OLAP

A company’s OTP (online transaction processing) environment is where the money is made.  It’s typically a row-based, transaction oriented, ACID compliant store.  Vendors like Oracle, Sybase, & Microsoft dominate here with a growing segment of PostGreSQL and MySQL use.  A business needs speed and they need transactions here – you want to know if someone has bought something or not and it needs to happen quickly so that customers don’t get upset and go somewhere else.  The data models are normalized, key-based, and sparse.

When a company wants to analyze sales, or costs, or whatever, they typically extract all the data from their OTP environment, transform and load it into their data warehouse and may update their OLAP environment at the same time.  Updating the OLAP environment involves taking all of the transaction data from the OTP environment and exploding it into huge fact tables with corresponding dimension tables.  And lots and lots of pre-aggregated results. This is all so that end user OLAP tools can spin the data to provide analysts with way to analyze all the OTP data.  Questions like “let me see sales by product by region by quarter,” or planning questions like “what happens if we raise salaries by 5%?”  This creates a lot of data.  And all that data takes more space.  All because companies don’t want to mess with the OTP environment – and because disk is slow.

PHYSICAL CONSTRAINTS

Most of the database technology so-called innovations in the relational world exist because disk is slow.  Normalized tables, indexes, etc. all exist so that data can be moved around as fast as possible because disk is slow.  And when relational databases were invented, disk was really slow.  But if an in-memory database was fast enough to not only handle the needs of an OTP environment, but also produce exploded fact tables and compute dimensional analysis on the fly, we don’t need all that extra disk space.  Sure, we still need transactions because we need to know when something happened.  But that can be done with memory now too.

ON A CLEAR DISK, YOU CAN SEEK FOREVER

But if in-memory databases are that fast now, what happens to CEP’s value proposition?  Every meaningful CEP based system I’ve worked on in the last 5 years has involved some form of persistence somewhere.  Seems like you can’t really separate the two.  So why not combine them if in-memory databases are fast enough now to support this type of behavior?

MPP DATABASES & MAP/REDUCE

CEP is fine for solving some problems, but typically not those problems involving either a lot of data or a lot of compute.  Most CEP systems involve taking a high velocity data stream, decorating it with some fairly simple calculations, and then doing something when a trigger fires.  A perfect example of this is algo, or High Frequency Trading.  We haven’t seen CEP in sophisticated derivative or fixed income environments because of the compute required – that doesn’t fit the CEP model of yester-year.

Using massively parallel processing databases & map/reduce solves a very big problem in this area – data affinity.  In a grid or cloud based compute model, it’s easy to saturate the network getting data from where it lives to where it needs to be processed.  If a compute process is broken down using a map/reduce foundation, and the compute is run where the data lives, and results are then bubbled up, not only does the compute get done faster, but there’s less of a chance of saturating the network as well.

SO WHY NOT EMBED CEP IN YOUR MPP IN-MEMORY DATABASE?

That seems like a good idea to me.  Maybe the resulting latency isn’t low enough for things like algo trading, but then most applications don’t require that kind of latency.  On the plus side, you get to take advantage of virtual resource, cloud based things like compute, storage, and network.  That way, you can dynamically add more compute when you need to; as your business grows or to accommodate spikes.  And my bet is that the latency problem isn’t that far from being solved; even for algo trading applications.

MY PREDICTION

So if you’re CEP solution doesn’t also have a good persistence story, you’re toast.  And if your database solution doesn’t have a good CEP story, you’re toast.  I know vendors in both spaces who are not being considered for opportunities because they’re missing one of those critical components.  Vendors like Oracle, Sybase, and now SAP agree with me.  And customers do too.  Customer’s are always right, right?

AND AS ALWAYS

Thanks for reading.

Let's Blame Everything on CEP & HFT

In a recent blog post, Tim Bass blames CEP for much of the world’s problems.  You can read his post here.

You can read my response here:

Tim,

Lot’s of misinformation and knee jerk reactions are out there regarding HFT. Unfortunately, this is another one of them. Your posts are usually a bit inflammatory; that’s why I like reading your blog. But you usually provide at least a shred or two of evidence to support your thesis.

I’d really like to see some examples of how HFT is damaging individual investors – I’m personally not aware of any and I’ve spent the last 20+ years in the field – even before CEP became a popular buzz word. If you’ve got some examples, I’d love to see them.

We went though this same controversy back in 1987 and many people blamed that crash on portfolio insurance and automated trading then – but they were wrong then, as they are now in blaming the recent flash crash on HFT. When there are no buyers, prices plunge. When there are no buyers, HFT stops – HFT firms aren’t usually sending market orders blindly in a falling market. It’s just stupid to do so.

The recent flash crash has to do more with market structure and too much global debt – NYSE halted (temporarily, I know they don’t like that term) for what’s referred to as a LRP, or liquidity replenishment point. The goal of an LRP is to put a symbol into auction mode when a bunch of orders come in on one side, resulting in an imbalance. When those symbols were halted, orders for those symbols went to other exchanges. Where there was no liquidity and those orders ended up crossing at what are referred to as stub quotes. Stub quotes are required by some crossing engines to open the symbol for trading. No one is putting real bids in at a penny. When those orders crossed at a penny, the index calculations picked up the crosses and the ‘market’ plunged when the index re-priced. If those sell orders that went off at a penny had been limit orders, this probably wouldn’t have happened. This may be an example of SOR – Stupid Order Routing.

When NYSE re-opened those symbols, trading resumed and, magically, at the prices they were at before the LRP was activated. So that worked for NYSE. What didn’t work was other exchanges not honoring NYSE’s action. Knowing the rules is important; sending orders to off-primary exchanges can incur additional risk. We saw that during the flash crash.

So, this would have happened with or without HFT; with or without CEP; and with or without the vendors that sell all the hardware and software to make HFT possible. In fact, it could have happened with even 1 large sell order in an affected symbol. And HFT doesn’t usually involved large orders.

In fact, in this day of lessened or no return for those who use to act as intermediaries in the market (specialists or market makers), it could be argued that those engaging in HFT are actually helping the individual trader (or investor) by providing more liquidity.

It’s easy to blame HFT for the machinations we see in the market today – it’s harder to blame your neighbor for taking on more debt than they can afford to buy more and more goods than they can afford. And, as we saw during the flash crash, consumerism and greed is no longer an American only obsession.

Some day, you’ve got to pay.  And thanks for reading.

Search, Themes, Entities, and Sentiment Analysis

I’ve been working on a project lately involving looking for and processing any and all data for a subset of the OTC equities market here in the US.  Listening to a description of what needed to be accomplished, someone piped up and said, “Well, this should be pretty easy.  All we have to do is search for the stock symbols, company names, etc, persist the data, and start calculating our baseline statistics.”

NOT SO PRONTO, TONTO

Try searching the web for a stock symbol, or for 10,000 stock symbols, and you’ll get a jumble of information.  For instance, searching for the symbol “ABLE” can provide hours of entertainment; and completely meaningless results.  So how do we narrow things down a bit.

ENTITY EXTRACTION

By using computational linguistics, we can scan source documents, be they Tweets, Blogs, News, whatever really and look for what are referred to as entities.  Using such approach let’s us not only look for “ABLE” as above, but only where “ABLE” is a noun, as opposed to a verb.  This gets us further down the line.  Combining these types of techniques along with company name, etc. narrows the results down quite a bit.

SUMMARIES GET YOU EXACTLY THAT

If you can’t extract the search entity from the source document, how do you know if the sentiment you’ve calculated for the document actually has anything to do with what you where searching for.  That’s why having the ability to calculate sentiment for an entity identified within the document is as important, if not more, than summarizing the sentiment score at the document level.

CONSTRUCTING INVERTED INDICES

Using the correct approach, we should be able to calculate a document’s summary, any relevant themes, extract referenced entities and calculate sentiment for each.  This is a great problem to solve using elastic resource.  Just keep adding VM’s running the right stack until you can keep up with the inflow.  This is also a great application for Map/Reduce, as I’ve discussed previously.  In this fashion, as elemental information is extracted and calculated for the source document, those keys are either created or updated with the document link id in addition to storing the source document somewhere.

NOSQL = NOTOOLS?

There is a potential need for using a NoSQL database for storing some of the information above.  Several come to mind – MongoDB & CouchDB for document storage, Cassandra for inverted indices, etc.  We’ve even got some of these running with DarkStar right now ,consuming raw information from the web and running streaming Map/Reduce and CEP based analytics and queries against it.  But while some of these open source databases are great for what they’re intended for, there is a severe lack of tools available – tools that we typically take for granted when building out very large datasets that we actually want to do something with after we’ve stored all that data.  If you decide to go down the NoSQL route, be aware of the trade-off’s you’re making, either consciously or not – assumptions can kill a project.

AND AS USUAL

Thanks for reading!

Twitter, DarkStar & Telescope – Delivering Sentiment to an Equities Broker

Cloud Event Processing has moved off the white board and into the real world of, ‘hey, we’ve got customers!’

Twitter

When building a system like DarkStar, one can always run into difficulties – how does one demo a cloud based, distributed event processing system incorporating streaming map/reduce, complex event processing, and event driven pattern matching agents to prospective customers?  Our background includes very high velocity feeds, like equities and option market data feeds, but using those for demo purposes can be difficult.  So we decided on using Twitter.  We decided to use Twitter because we could access tweets via a feed, Twitter is relevant, and using DarkStar to analyze the relatively unstructured contents of a tweet seemed like a good way to strut our stuff.

Little Did We Know

Our demo has turned into a lot more than just a demo.  In addition to analyzing the twitter stream for things like most popular links, #hash tags, mentions, etc, we thought it might be cool to analyze the sentiment expressed in tweets relative to certain topics.  So we went ahead and incorporated that into DarkStar.  Our first public display of this was calculating the sentiment, as expressed within Twitter, regarding VMware’s acquisition of RabbitMQ.  DarkStar gave the acquisition a ‘thumbs-up’ with a streaming sentiment rating of ~.45 – which is pretty darn good.  We’d been doing a lot of work in Ze Bunkah (our lab), but had yet to go public with any results or announcement.

What’s Going On In There?

While DarkStar is really good at consuming both structured and unstructured data and doing neat stuff with it, it’s not too good at displaying the results of its parallel ruminations in any other way than a streaming data feed.  So to help our customers find out what’s going on inside of DarkStar, we began work on Telescope.  Telescope is our rich internet application, written predominantly in Flex.  Using DarkStar, customer’s can submit queries to our cloud, have DarkStar compute the results, and submit those results back to Telescope for further manipulation and visualization.

Another First for CEP and Cloud Event Processing

Although other firms certainly have adapters for submitting queries and obtaining results from Twitter, this is the first application of complex event processing to sentiment analysis that uses Twitter to deliver results.  At least as far as we know of.  And we’re excited about it.  Almost as excited as the customers we’re dealing with.  I’m sure that once we’re up and running that we’ll have some customer specific results to share.

Until then, thanks for reading!

Building Inverted Indexes on Tweets Using Map/Reduce

In our last post, we looked at how to make bad map/reduce code better map/reduce code.  A natural fallout from breaking tweets down into words is the ability to build an inverted index to facilitate searching tweets by key words.

It’s All in the Tweet

Given the tweets,

  1. “@eventcloudpro I like the idea of tree maps, if you combine these with metrics trees in #PM Strategy Management tools u could align the two,” and
  2. “RT @jakewk RT @eventcloudpro: did streambase find a buyer? http://bit.ly/brTJlx SunGard likes to buy their technology partners #cep”
  3. “I really like #erlang – it’s rocking technology!”

How do we compute an inverted index? First, let’s assign a unique ID to each tweet above – for our example, that’s tweet #1, tweet #2, and tweet #3.  Now we want to find which words are used where so we’ll construct a table consisting of words and the list of tweet id’s that contain those words.

Word                    Tweet

@eventcloudpro     1,2

metrics                  1

technology            2, 3

and so on…

Using the Index

To find tweets with the words we’re interested in, we just query the inverted index.  For example, if I’m interested in finding tweets with the word ‘metrics’, using the index I see that tweet #1 has that word, so I look up tweet #1.  If I’m interested in the words ‘@eventcloudpro’ and ‘technology’, I get the sets (1,2) and (2,3) – and they’re intersection is tweet #2.

I’d Like Extra Sauce, Please

So now that we’ve figured out how to look up specific tweets based upon content, how could we look up tweets based upon other stuff; stuff like categorization, entity extract, root words (stemming), or even sentiment?  By running the tweets through a system that calculates these value added goodies, we could construct additional inverted indexes to further organize our tweets.

Map/Reduce

Using map/reduce in the process of calculating inverted indices is a natural fit.  And there’s a great introduction on how to do this using Erlang in Joe Armstrong’s book, “Programming Erlang.”

Writing Bad Code in Map/Reduce

In the Twitter project we’ve been working on, one of the map’s we’re running breaks the text of a Tweet down into words.  Because we can’t assume that any data will be available for access via a database, etc, we attach a couple of values that we’re interested for later analysis to the word, attach a 1, and emit the tuple.  This is an example of what the tuple looks like:

"TimeZone", "Location", "ScreenName", "Word", 1

This map is produced from a tweet that contains many words, so, crunching the Tweet down will result in many of the above values being duplicated.

"TimeZone", "Location", "ScreenName", "anotherWord", 1
"TimeZone", "Location", "ScreenName", "yetAnotherWord", 1
"TimeZone", "Location", "ScreenName", "Word", 1
"TimeZone", "Location", "ScreenName", "Word", 1

But if you notice, there’s a way to compress this a bit. On of the problems with map/reduce (maybe not a problem, but a caveat) is that it’s quite easy to swamp the network. In the example above, we’ll be emitting a tuple (the tuple is slightly longer in production) and repeating information for each word.

What if instead, we used an associative array when crunching down the tweet. For example, after crunching the data above, our data structure would look something like:

"TimeZone", "Location", "ScreenName", {word, 3, anotherWord, 1, yetAnotherWord, 1}

By attaching an associative array of words and their count to the result in the map we dramatically decrease the amount of data being moved around. This reduces both the storage required (if you’re doing old fashioned Hadoop) but more importantly, the network traffic if you’re doing streaming map/reduce.

Of course, this means that the reduce function needs to be aware that it’s getting more than the simple string emitted in our first example, but we’ve saved a lot of bandwidth in the process.

Building a Private Cloud (IaaS)

Cloud Defined (IaaS or Infrastructure as a Service)

I was speaking with a company last week that would like to get into cloud event processing.  For the sake of discussion, I’m going to define ‘cloud’ as the ability to dynamically, and upon demand, use more or less compute, storage, or network capacity.  This is often referred to as ‘elasticity.’  I’m also going to add a few other requirements; this cloud must be able to:

  1. Host operations for a number of customers, all accessing the same data, but in different ways,
  2. Provide access to a lot of big data (see #3 – it’s all stored too),
  3. Able to handle messages rates approach 1MM per second (yes, that’s a million messages per second),
  4. Do something with those messages before storing them (in fact, we’re not even worried about storing them),
  5. And while we’re doing something with those messages, relate them to previously stored messages.

Cloud Event Processing Defined

And I’m going to define cloud event processing as, ‘using many varieties of event processing; including complex event processing (CEP), map/reduce, and event, or message, driven agents to process events.’  Using DarkStar, we’re able to deploy these various forms of event processing without regard for underlying storage, compute or network – the event processing services are simply deployed and begin processing; if a particular service needs more resources, it asks for them and when it receives them, it uses them.

Private Cloud Defined

It’s a private cloud when you deploy the above (Iaas) internally.  It’s a public cloud when it’s deployed publicly, and it’s a hybrid cloud when it’s deployed both privately and publicly.

Data Affinity

The large amount of inbound network bandwidth that would be required to get this particular cloud off the ground and into the sky already makes hosting with Amazon or Rackspace more than a ‘sign up with a credit card’ process.  Having the elastic resources available to each customer of our customer makes things even more difficult.  See, our customer has 100′s of customers who want to do something with this data.  And each one of those customers will need access to both the old data, and the streaming 1MM messages per second.  Without creating a separate physical footprint in a cloud provider, it would be very difficult to solve this problem.

Buy vs Build

The more things change, the more they stay the same.  So, now we’re involved in the old, ‘how much to build vs buy?’ process.  The thing is, this company has the skills to run their own cloud so we’ll most likely start with a proof of concept and then grow it.  We’re cloud seeding at Cloud Event Processing!

The Test Platform

Our recent work with Twitter will be our first deployment in this cloud infant.  Using DarkStar, we’re able to analyze the Twitter feed to sense and respond to opportunities and threats.  This relatively simple project includes components of event driven agents, NoSQL storage, CEP sliding windows & aggregation, visualization, map/reduce and analytics.  (For those of you who’ve been reading all along, you’re asking, ‘analytics?’-we didn’t see any of those Colin! – and you’re right, you didn’t – we’ll share those when we’re ready.)

Kickin’ Tires & Lightin’ Fires

We’ll be starting off with some bog standard Intel hardware, Eucalyptus, and the Ubuntu Enterprise Cloud.  There’s a white paper available on Eucalyptus’s site that details our initial approach.

Thanks for reading!

Cloud Aware or Cloud Built

So You’re Renting Condo’s in the Cloud, Big Deal, What’s Your Next Act?

Here’s another quote from the article, “21 Experts Define Cloud Computing

“The ‘cloud’ model initially has focused on making the hardware layer consumable as on-demand compute and storage capacity. This is an important first step, but for companies to harness the power of the cloud, complete application infrastructure needs to be easily configured, deployed, dynamically-scaled and managed in these virtualized hardware environments.”

Kirill Sheynkman

I like this quote too – but for different reasons than may be first apparent.  We’re watching something evolve at light speed; the cloud.  And the definition above is now taken for granted.  I should be able to dynamically, upon request, add stuff to my compute environment.  Big deal.  How do we take advantage of it?  There’s a big difference between a ‘cloud aware’ application and an application built for the cloud from the ground up.

Horizontal vs Vertical

For an application to take advantage of the cloud, as loosely defined above, it must not only scale vertically, but horizontally as well.  The advent of cloud computing doesn’t obviate the need to write well written systems – systems that take advantage of local compute resources.  Horizontal scalability introduces another axis; “How do we make sure that our application scales just by throwing more machines, disk or network at it?”  The answer is just as complex, if not more, than all of the science required to build a well behaving, vertical scaling application.  It’s more complex because a product has to take into account things it usually depends upon the client to provide via 3rd party products.  Products like cache, storage, bus, management, etc.

Being ‘Cloud Aware’ is Not The Same Thing

So you can run your app in a data center – does this make your app cloud aware?  I guess so but it certainly doesn’t take advantage of elastic resources.  And chances are, that ‘cloud aware’ app you’ve got running in your private, public, or hybrid cloud doesn’t have a pricing model that’s cloud friendly.  There’s another evolution that’s required to fully support cloud deployments, and that’s a cloud aware pricing model.  Try scaling to 1,000′s of cores with a traditional, crossing the chasm, we’ll sell you a license vendor, and the second brick wall you’ll run into after not being able to scale horizontally will be the license fee.  Closely followed by the ever increasing support and maintenance.

OMG Event Processing Symposium 2010

I’ll be speaking at the OMG’s Event Processing Symposium 2010 this coming May.  We’ll be demonstrating streaming map/reduce, complex event processing, and advanced visualization (DarkStar & Telescope) with at least one specific use case to demonstrate how to use Event Processing for Continuous Intelligence.  More details about the conference can be found here.

Dynamic Cloud Resources and Accountability

I reread this article from time to time just to make sure that I stay within some boundaries – 21 Experts Define Cloud Computing.  Among the 21, there are a couple that I really like; I’m going to cite a few of them over the next few days, and tell you what I like and don’t like about them (Also – remember, the title reads “21 Experts…” – it didn’t say “21 Experts in Cloud Computing…” – details matter).  This article was brought to my attention in a blog post, “Rumblings in the Cloud,” by Louis Lovas at Progress Apama.

Dynamic Cloud Resources – While You Wait (or “I’ll have an EC2 grid, monster that, please”)

“What is cloud computing all about? Amazon has coined the word “elasticity” which gives a good idea about the key features: you can scale your infrastructure on demand within minutes or even seconds, instead of days or weeks, thereby avoiding under-utilization (idle servers) and over-utilization (blue screen) of in-house resources. With monitoring and increasing automation of resource provisioning we might one day wake up in a world where we don’t have to care about scaling our Web applications because they can do it alone.”

Markus Klems

I like this quote for a couple of reasons – but there’s something missing here, and it’s missing in the rest of quotes in the article as well.  There’s no mention of accountability.  If an organization is making use of cloud resources, where is the demand coming from?  There are a couple of broad categories that this could fall into, 1) outsourced apps like email, project management, analytics, google apps, etc. (for a great list, see Bessemer’s Rules for the Cloud),  2) operational stuff – development, testing, storage, and 3) REVENUE producing activities.

About 15 years ago, I went on a rampage through one of the companies I was managing.  The exercise was simple, if an activity, report, spreadsheet, fax, etc. couldn’t be tied to a client deliverable (read that ‘REVENUE’) then it was chucked in the bin.  This seemingly simple process elicited sometimes visceral responses; yelling, kicking, and even screaming on occasion.  There was more to this than met the eye.

Nice story, what’s this got to do with dynamic cloud resource allocation.  Easy answer – if resources are being requested in the cloud, what are they for?  If resources are being dynamically scaled, what incremental revenue was produced?  Notice that I make no distinction amongst compute, storage, or network here – it doesn’t matter.  The big question is how do dynamic cloud resources help my company make money?

And that is Accountability – if a resource can’t be tied to a necessary, sponsored, or revenue producing activity then it is cut.  And the cloud makes it all that more easy to ‘chuck stuff in the bin.”  Or, just because your web site can now scale automagically, so what?