In our last post, we looked at how to make bad map/reduce code better map/reduce code. A natural fallout from breaking tweets down into words is the ability to build an inverted index to facilitate searching tweets by key words.
It’s All in the Tweet
Given the tweets,
- “@eventcloudpro I like the idea of tree maps, if you combine these with metrics trees in #PM Strategy Management tools u could align the two,” and
- “RT @jakewk RT @eventcloudpro: did streambase find a buyer? http://bit.ly/brTJlx SunGard likes to buy their technology partners #cep”
- “I really like #erlang – it’s rocking technology!”
How do we compute an inverted index? First, let’s assign a unique ID to each tweet above – for our example, that’s tweet #1, tweet #2, and tweet #3. Now we want to find which words are used where so we’ll construct a table consisting of words and the list of tweet id’s that contain those words.
Word Tweet
@eventcloudpro 1,2
metrics 1
technology 2, 3
and so on…
Using the Index
To find tweets with the words we’re interested in, we just query the inverted index. For example, if I’m interested in finding tweets with the word ‘metrics’, using the index I see that tweet #1 has that word, so I look up tweet #1. If I’m interested in the words ‘@eventcloudpro’ and ‘technology’, I get the sets (1,2) and (2,3) – and they’re intersection is tweet #2.
I’d Like Extra Sauce, Please
So now that we’ve figured out how to look up specific tweets based upon content, how could we look up tweets based upon other stuff; stuff like categorization, entity extract, root words (stemming), or even sentiment? By running the tweets through a system that calculates these value added goodies, we could construct additional inverted indexes to further organize our tweets.
Map/Reduce
Using map/reduce in the process of calculating inverted indices is a natural fit. And there’s a great introduction on how to do this using Erlang in Joe Armstrong’s book, “Programming Erlang.”



