Category Archives: Erlang

Building Inverted Indexes on Tweets Using Map/Reduce

In our last post, we looked at how to make bad map/reduce code better map/reduce code.  A natural fallout from breaking tweets down into words is the ability to build an inverted index to facilitate searching tweets by key words.

It’s All in the Tweet

Given the tweets,

  1. “@eventcloudpro I like the idea of tree maps, if you combine these with metrics trees in #PM Strategy Management tools u could align the two,” and
  2. “RT @jakewk RT @eventcloudpro: did streambase find a buyer? http://bit.ly/brTJlx SunGard likes to buy their technology partners #cep”
  3. “I really like #erlang – it’s rocking technology!”

How do we compute an inverted index? First, let’s assign a unique ID to each tweet above – for our example, that’s tweet #1, tweet #2, and tweet #3.  Now we want to find which words are used where so we’ll construct a table consisting of words and the list of tweet id’s that contain those words.

Word                    Tweet

@eventcloudpro     1,2

metrics                  1

technology            2, 3

and so on…

Using the Index

To find tweets with the words we’re interested in, we just query the inverted index.  For example, if I’m interested in finding tweets with the word ‘metrics’, using the index I see that tweet #1 has that word, so I look up tweet #1.  If I’m interested in the words ‘@eventcloudpro’ and ‘technology’, I get the sets (1,2) and (2,3) – and they’re intersection is tweet #2.

I’d Like Extra Sauce, Please

So now that we’ve figured out how to look up specific tweets based upon content, how could we look up tweets based upon other stuff; stuff like categorization, entity extract, root words (stemming), or even sentiment?  By running the tweets through a system that calculates these value added goodies, we could construct additional inverted indexes to further organize our tweets.

Map/Reduce

Using map/reduce in the process of calculating inverted indices is a natural fit.  And there’s a great introduction on how to do this using Erlang in Joe Armstrong’s book, “Programming Erlang.”

Sometimes Erlang Just Makes More Sense

I love this excerpt from “Programming Erlang, Software for a Concurrent World” written by Joe Armstrong.

Erlang processes don’t share memory, so there is no need to lock the memory while it is being used.  Where there are locks, there are keys that can get lost.  What happens when you lose your keys?  You panic and don’t know what to do.  That’s what happens in software systems when you lose your keys and your locks go wrong.

Distributed software systems with locks and keys always go wrong.

Erlang has no locks and no keys.

In 2005, I spent a great deal of time explaining to customers and investors (when I was at Kaskad) why our streaming engine was so cool.  There were no locks.  This was not something well understood at the time.  And it was not well received, either by potential customers or investors.  People shook their heads in disbelief. “How was this possible?”  they asked.  I wouldn’t share with them our secret sauce, but I would point them to a couple of research papers.

How I wish I knew Erlang and had Joe’s book back then.

A Picture Is Worth 1,000 Words – Current State of Map/Reduce in CEP Land

Map/Reduce in CEP Land – A Simple Example

I thought I would share just how easily the example I describe in this post could be implemented.  Here’s some code and commentary:

This is from the Erlang shell:

1> {module,darkstar}
we've loaded the DarkStar CEP Module
2> {module,mapwordcount}
and the map function
3> {module,stream}
and we'd like a stream please
4> Stream = stream:start().
let's start up a stream
<0.36.0>
Stream started
5> Supervisor=darkstar:start().
and start up DarkStar
<0.38.0>
Supervisor is now running
6> mapwordcount:map("colin clark loves CEP", Stream, Supervisor).
and call the map function with a Stream and a Supervisor process
<0.31.0>
map is now running
Received:<0.31.0>
when the map is done, it lets the Supervisor know
Enqueing "colin", 1
here, we see "colin clark loves CEP" broken down into
words and appended to the Stream
7> Enqueing "clark", 1
7> Enqueing "loves", 1
7> Enqueing "CEP", 1

It’s that easy.  It’s also very straightforward to create a Window on the Stream and ask it to sum by key over a time frame and updating every minute or so (for example).

Now granted, I’ve left out a couple of things (like the underlying code),  and we’re running in verbose mode so that we can get a little visibility into the Stream.  One thing to note though is that mapwordcount and the stream could be running on any node within the DarkStar fabric.  I could have also started mapwordcount via the Supervisor, which is much more ‘grid like’ than the example I’ve shown above but this should demonstrate the gist of where I’m going.

And yes, DarkStar is written in Erlang.

Hands on Review of the Dynamo Paper

One of the reasons I like Erlang so very much.  And an excellent introduction to the power of Erlang.  Check out this link.