Category Archives: Cloud Event Processing

Writing Bad Code in Map/Reduce

In the Twitter project we’ve been working on, one of the map’s we’re running breaks the text of a Tweet down into words.  Because we can’t assume that any data will be available for access via a database, etc, we attach a couple of values that we’re interested for later analysis to the word, attach a 1, and emit the tuple.  This is an example of what the tuple looks like:

"TimeZone", "Location", "ScreenName", "Word", 1

This map is produced from a tweet that contains many words, so, crunching the Tweet down will result in many of the above values being duplicated.

"TimeZone", "Location", "ScreenName", "anotherWord", 1
"TimeZone", "Location", "ScreenName", "yetAnotherWord", 1
"TimeZone", "Location", "ScreenName", "Word", 1
"TimeZone", "Location", "ScreenName", "Word", 1

But if you notice, there’s a way to compress this a bit. On of the problems with map/reduce (maybe not a problem, but a caveat) is that it’s quite easy to swamp the network. In the example above, we’ll be emitting a tuple (the tuple is slightly longer in production) and repeating information for each word.

What if instead, we used an associative array when crunching down the tweet. For example, after crunching the data above, our data structure would look something like:

"TimeZone", "Location", "ScreenName", {word, 3, anotherWord, 1, yetAnotherWord, 1}

By attaching an associative array of words and their count to the result in the map we dramatically decrease the amount of data being moved around. This reduces both the storage required (if you’re doing old fashioned Hadoop) but more importantly, the network traffic if you’re doing streaming map/reduce.

Of course, this means that the reduce function needs to be aware that it’s getting more than the simple string emitted in our first example, but we’ve saved a lot of bandwidth in the process.

Building a Private Cloud (IaaS)

Cloud Defined (IaaS or Infrastructure as a Service)

I was speaking with a company last week that would like to get into cloud event processing.  For the sake of discussion, I’m going to define ‘cloud’ as the ability to dynamically, and upon demand, use more or less compute, storage, or network capacity.  This is often referred to as ‘elasticity.’  I’m also going to add a few other requirements; this cloud must be able to:

  1. Host operations for a number of customers, all accessing the same data, but in different ways,
  2. Provide access to a lot of big data (see #3 – it’s all stored too),
  3. Able to handle messages rates approach 1MM per second (yes, that’s a million messages per second),
  4. Do something with those messages before storing them (in fact, we’re not even worried about storing them),
  5. And while we’re doing something with those messages, relate them to previously stored messages.

Cloud Event Processing Defined

And I’m going to define cloud event processing as, ‘using many varieties of event processing; including complex event processing (CEP), map/reduce, and event, or message, driven agents to process events.’  Using DarkStar, we’re able to deploy these various forms of event processing without regard for underlying storage, compute or network – the event processing services are simply deployed and begin processing; if a particular service needs more resources, it asks for them and when it receives them, it uses them.

Private Cloud Defined

It’s a private cloud when you deploy the above (Iaas) internally.  It’s a public cloud when it’s deployed publicly, and it’s a hybrid cloud when it’s deployed both privately and publicly.

Data Affinity

The large amount of inbound network bandwidth that would be required to get this particular cloud off the ground and into the sky already makes hosting with Amazon or Rackspace more than a ‘sign up with a credit card’ process.  Having the elastic resources available to each customer of our customer makes things even more difficult.  See, our customer has 100′s of customers who want to do something with this data.  And each one of those customers will need access to both the old data, and the streaming 1MM messages per second.  Without creating a separate physical footprint in a cloud provider, it would be very difficult to solve this problem.

Buy vs Build

The more things change, the more they stay the same.  So, now we’re involved in the old, ‘how much to build vs buy?’ process.  The thing is, this company has the skills to run their own cloud so we’ll most likely start with a proof of concept and then grow it.  We’re cloud seeding at Cloud Event Processing!

The Test Platform

Our recent work with Twitter will be our first deployment in this cloud infant.  Using DarkStar, we’re able to analyze the Twitter feed to sense and respond to opportunities and threats.  This relatively simple project includes components of event driven agents, NoSQL storage, CEP sliding windows & aggregation, visualization, map/reduce and analytics.  (For those of you who’ve been reading all along, you’re asking, ‘analytics?’-we didn’t see any of those Colin! – and you’re right, you didn’t – we’ll share those when we’re ready.)

Kickin’ Tires & Lightin’ Fires

We’ll be starting off with some bog standard Intel hardware, Eucalyptus, and the Ubuntu Enterprise Cloud.  There’s a white paper available on Eucalyptus’s site that details our initial approach.

Thanks for reading!

Cloud Aware or Cloud Built

So You’re Renting Condo’s in the Cloud, Big Deal, What’s Your Next Act?

Here’s another quote from the article, “21 Experts Define Cloud Computing

“The ‘cloud’ model initially has focused on making the hardware layer consumable as on-demand compute and storage capacity. This is an important first step, but for companies to harness the power of the cloud, complete application infrastructure needs to be easily configured, deployed, dynamically-scaled and managed in these virtualized hardware environments.”

Kirill Sheynkman

I like this quote too – but for different reasons than may be first apparent.  We’re watching something evolve at light speed; the cloud.  And the definition above is now taken for granted.  I should be able to dynamically, upon request, add stuff to my compute environment.  Big deal.  How do we take advantage of it?  There’s a big difference between a ‘cloud aware’ application and an application built for the cloud from the ground up.

Horizontal vs Vertical

For an application to take advantage of the cloud, as loosely defined above, it must not only scale vertically, but horizontally as well.  The advent of cloud computing doesn’t obviate the need to write well written systems – systems that take advantage of local compute resources.  Horizontal scalability introduces another axis; “How do we make sure that our application scales just by throwing more machines, disk or network at it?”  The answer is just as complex, if not more, than all of the science required to build a well behaving, vertical scaling application.  It’s more complex because a product has to take into account things it usually depends upon the client to provide via 3rd party products.  Products like cache, storage, bus, management, etc.

Being ‘Cloud Aware’ is Not The Same Thing

So you can run your app in a data center – does this make your app cloud aware?  I guess so but it certainly doesn’t take advantage of elastic resources.  And chances are, that ‘cloud aware’ app you’ve got running in your private, public, or hybrid cloud doesn’t have a pricing model that’s cloud friendly.  There’s another evolution that’s required to fully support cloud deployments, and that’s a cloud aware pricing model.  Try scaling to 1,000′s of cores with a traditional, crossing the chasm, we’ll sell you a license vendor, and the second brick wall you’ll run into after not being able to scale horizontally will be the license fee.  Closely followed by the ever increasing support and maintenance.

If You Store My Tweets Today, I'll Gladly Look at Them Next Tuesday

Cloud Event Processing – Where’s The Data?

There are several things left to cover in our #TwitYourl proejct.  One of the most glaring absences so far is storage – where do we put all of these Tweets?  What happens if we’d like to make changes to our RuleBots and test them with historical data?  Right now, we can’t.

In keeping with some of the rules of our implementation, an OnRamp is not allowed to update a database directly.  This would tie the OnRamp to the database implementation and make our architecture more brittle than it has to be.  Also, it would mean that the OnRamp had to have specific information about where it was sending a message – this is not allowed.

Remember, we are architecting for the cloud – the idea of baked in, point to point, connections inside of our cloud should have you recoiling in horror by now.  So how can we update databases easily within our Cloud Event Processing framework?

Well, it would be really easy if we had some metadata that described all of the data, or events, that the OnRamps were publishing on the bus.  And it would be even neater if all of that metadata was available in a central database or directory.  This way, when the infrastructure that #TwitYourl is built on top of is running, all one would need to do is go to some management console, bring up the information about how we’d like to handle a particular event, click the ‘Store in a Database” button and be done with it.  Click, click, click and we’re storing events – automagically; somewhere in the cloud!

So, let’s do this for #TwitYourl – I vote for MongoDB.  I know that Cassandra seems to be a fairly popular choice this past week, but we’re going to buck that trend and hook up MongoDB inside of our cloud and we’re going to feed Mongo some yummy Tweets!

Sorry, I couldn’t resist.

But before feeding Mongo, we’re going to work on some visualization this week.  We’re running a little behind schedule here and we need to get caught up!

Cloud Event Processing-Observations & Next Steps

Over the past few weeks, I’ve implemented map/reduce using techniques commonly found in Complex Event Processing.  Here’s a summary of what was involved, and what tools would make such a deployment easier.

Getting the Data

One of the first tasks accomplished was the creation of an OnRamp – we use OnRamps to get data into our cloud for processing.  The specific OnRamp used in this learning exercise subscribed to Twitter and fed the resulting JSON objects onto the service bus, RabbitMQ in this case.  We had to correctly configure RabbitMQ for this, and the OnRamp needed to be specifically aware of and implement semantics required to publish on this bus.  It would be easier and more portable if this were abstracted in some type of OnRamp api; we had abstracted this at Kaskad.  In Korrelera, the bus didn’t matter – we could just as easily use direct sockets, JMS, Tibco or 29West.  The OnRamp didn’t know, and didn’t care.  In our TwitYourl example, there’s no way to monitor or manage the OnRamp other than tailing its output and visually inspecting it.  There is no central management or operations console.

Definition of Services

Although we’ve used Map/Reduce as our first example, the topology doesn’t really matter.  What matters is that we created a number of services and then deployed them.  In our small example, we wrote a RuleBot that performed the Map function in Map/Reduce.  This RuleBot listen for Tweet JSON objects, pulled them apart, found the information we were interested in, chunked it, and then fed it back onto the service bus. Another RuleBot performed the Reduce function – events were pumped into the Esper open source CEP engine where the could then be queried, Again, the RuleBots had to be aware the underlying bus’s semantics and could not be managed or monitored in our TwitYourl example.

Deployment to the Cloud

All of this had to then be deployed to the cloud – there are two main components to this.  First, we assumed that each node in the cloud was configured correctly.  This had to be done by hand – it would have been much easier to have an image that contained everything we needed from an infrastructure, or plumbing, point of view that could have been deployed to any number of servers via point and click.  Secondly, the services themselves needed to be deployed, and as I’ve already pointed out, those services had to be aware of the bus, could not be managed, and could not be monitored.  All of this had to be done by hand.  And log files, or console windows had to be examined both operationally and to examine the fruits of our labors.

How to Make This Easier

First, we need a tool that will configure and provision any number of nodes in our cloud.  There are several vendors that have products in this space and I’m not going to talk about them here (yet).  Secondly, and more importantly, we need an architecture that is layered on top of the hardware/operating system/ESB/etc. that can accept and deploy services dynamically.  An implementation that can be monitored and managed remotely and allow the management of our solution both physically and at some abstracted level.

Another Layer of Abstraction

It would be very handy indeed if we could define what was going in our Event Processing Cloud and then push it out to the cloud.  We need the ability to iteratively develop services, test them with live data and deploy the service to a service pool.  Service pools define some chunk of work that must be done; RuleBots can join service pools and then be automagically managed by our CEP based load balancing tool.  OnRamps can be managed.  And everything going on can be examined, both physically and from a services point of view.  For example, TwitYourl may be running on 100 machines, but the business user really only cares about whether or not the service is available and that the results can be viewed and utilized.

What’s Next?

I’m going to outline the requirements, at a high level, of what this command and control architecture looks like, and we’re going to re-deploy TwitYourl using this new approach.  By doing this, we will be able to compare the ‘old’ way of deploying 1st generation CEP based solutions, which are designed to scale vertically on multiprocessor based single machines, and our new Cloud Event Processing approach which is designed to scale not only vertically, but also horizontally, running on many more machines either in a public, private, or hybrid cloud.  And then we’ll talk about a much better way to look at output than by monitoring a console or tailing a log file!

Thanks for following along!