## Tuesday, August 30, 2011

### A Tweet is Worth (at least) 140 Words

So, I recently read An Introduction to Information Theory: Symbols, Signals and Noise.

It is a very nice popular introduction to Information Theory, a modern scientific pursuit to quantify information started by Claude Shannon in 1948.

This got me thinking. Increasingly, people try to hold conversations on Twitter, where posts are limited to 140 characters. Just how much information could you convey in 140 characters?

After some coding and investigation, I created this, an experimental twitter English compression algorithm capable of compressing around 140 words into 140 characters.

So, what's the story? Warning: It's a bit of a story, the juicy bits are at the end.

UPDATE: Tomo in the comments below made a chrome extension for the algorithm

### Entropy

Ultimately, we need some way to assess how much information is contained in a signal. What does it mean for a signal to contain information anyway? Is 'this is a test of twitter compression.' more meaningful than '歒堙丁顜善咮旮呂'? The first is understandable by any english speaker, and requires 38 characters. You might think the second is meaningful to a speaker of chinese, but I'm fairly certain it is gibberish, and takes 8 characters. But, the thing is if you put those 8 characters into the bottom form here, you'll recover the first. So, in some sense to the messages are equivalent. They contain the same amount of information.

Shannon tried to quantify how we could estimate just how much information any message contains. Of course it would be very hard to try to track down every intelligent being in the universe and ask them if any particular message had any meaning to them. Instead, Shannon reserved himself to trying to quantify how much information was contained in a message produced by a random source. In this regard, the question of how much information a message contains becomes a more tractable question: How unlike is a particular message from all other messages produced by the same random source?

This question might sound a little familiar. It is similar to a question that comes up a lot in Statistical Physics, where we are interested in just how unlike a particular configuration of a system is from all possible configurations of a system. In Statistical physics, the quantity that helps us answer questions like this is the Entropy, where the entropy is defined as
$S = -\sum_i p_i \log p_i$
where p_i stands for the probability of a particular configuration, and we are supposed to sum over all possible configurations of the system.

Similarly, for our random message source, we can define the entropy in exactly the same way, but for convenience, let's replace the logarithm with the logarithm base 2.

$S = -\sum_i p_i \log_2 p_i$
At this point, the Shannon Entropy, or Information Entropy takes on a real quantitative meaning. It reflects how many bits of information the message source produces per character.

The result of all of this aligns quite well with intuition. If we have a source that outputs two symbols 0 or 1 randomly, each with probability 1/2. The shannon entropy comes out to be 1, meaning each of the symbols of our source is worth one bit, which we already new. If instead of two symbols, our source can output 16 symbols, 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F say, the shannon entropy comes out to be 4 bits per symbol, which again we should have suspected since with four bits we can count up to the number 16 in base 2 (e.g. 0000 - 0, 0001 - 1, 0010 - 2 , etc ).

Where it begins to get interesting is when all of our symbols don't occur with equal probability. To get a sense of this situation, I'll show 5 example outputs:
'000001000100000000010000010000'
'000000000010000000000001000000'
'010100000000000000000000111000'
'010100000000000000000000111000'
'000000000100000000110000000010'

looking at these examples, it begins to become clear that since we have a lot more zeros than ones, each of these messages contain less information than the case when 0 and 1 occur with equal probability.

In fact, in this case, if 0 occurs 90% of the time, and 1 occurs 10% of the time, the shannon entropy comes out to be 0.47. Meaning each symbol is worth just less than half a bit. We should expect our messages in this case to have to be about twice as long to encode the same amount of information.

In an extreme example, imagine you were trying to transmit a message to someone in binary, but for some reason, your device had a sticky 0 key so that every time you pushed 0, it transmitted 0 10 times in a row. It should be clear in this case that as far as the receiver is concerned, this is not a very efficient transmission scheme.

### English

What does this have to do with anything? Well, all of that and I really only wanted to build up a fact you already know. The fact is, the English language is not very efficient on a per symbol basis. For example, I'm sure everyone knows exactly what word will come at the end of this _______. There you go, I was able to express exactly the same thought with at least 8 fewer characters. n fct, w cn d _ lt bttr [in fact, we can do a lot better], using 22 characters to express a thought that normally takes 31 characters.

In fact, Shannon has a nice paper where he attempted to measure the entropy of the english language itself. Using more sophisticated methods, he concludes that english has an information entropy of between 0.6 and 1.3 bits per character, let's call it 1 bit per character. Whereas, if each of the 27 symbols (26 letters + space) we commonly use each showed up equally frequently, we would have 4.75 bits per character possible.

Of course, from a practical communication standpoint, having redundancies in human language can be a useful thing, as it allows us to still understand one another even over noisy phone lines and with very bad handwriting.

But, with modern computers and faithful transmission of information, we really ought to be able to do better.

This brings me back to twitter. If you are unaware, twitter allows users to post short, 140 character messages for the rest of the world to enjoy. 140 characters is not a lot to go on. Assuming 4.5 characters per word, this means that in traditionally written english you're lucky to fit 25 words in a standard tweet.

But, we know now that we can do better. In fact, if we could come up with some kind of crazy scheme to compress english in such a way as to use each of the 27 usual characters so that each of those characters appeared with roughly equal probability, we've seen that we could get 4.75 bits per character, with 140 characters and 5.5 symbols per word, this would allow us to fit not 25 words in a tweet but 120 words. A factor of 4.8 improvement.

Of course we would have to discover this miraculous encryption transformation. Which to my knowledge remains undiscovered.

But, we can do better. It turns out that twitter allows you to use Unicode characters in your tweets. Beyond enabling you to talk about Lagrangians (ℒ) and play cards (♣), it enables international communication, by including foreign alphabets.

So, in fact we don't need to limit ourselves to the 27 commonly used English symbols. We could use a much larger alphabet, Chinese say. I choose Chinese because there are over 20,900 Chinese alphabet symbols in Unicode. Using all of these characters, we could theoretically encode 14.3 bits of information per character, with 140 characters, and 1 bit per English character, and 5.5 symbols per English word, we could theoretically fit over 365 English words in a single tweet. But alas, we would have to discover some magical encoding algorithm that could map typed English to the Chinese alphabet such that each of the Chinese symbols occurred with equal probability.

I wasn't able to do that well, but I did make an attempt.

### My Attempt

So, I tried to compress the English language, an design an effective mapping from written English to the Chinese character set of Unicode. We know that we aim to have each of these Chinese characters occur with equal probability, so my algorithm was quite simple. Let's just look at a bunch of English and see which pair of characters occur with the highest probability, and map these to the first Chinese character in the Unicode set. Replace their occurring in the text, rinse, and repeat. This technique is guaranteed to reduce the probability at which the most common character occurs at every step, by taking some if its occurrences and replacing them, so it at least aims to achieve our ultimate goal. That's it. Of course, I tried to bootstrap the algorithm a little bit by first mapping the most common 1500 words to their own symbols.

For example, consider the first stanza of the Raven by Edger Allen Poe

Once upon a midnight dreary, while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore--
While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door.
"'Tis some visiter," I muttered, "tapping at my chamber door--
Only this and nothing more."


The most common character is ' ' (the space). The most common pair is 'e ' (e followed by space), so let's replace 'e ' with the first Chinese Unicode character '一' we obtain:

Onc一upon a midnight dreary, whil一I pondered, weak and weary,
Over many a quaint and curious volum一of forgotten lore--
Whil一I nodded, nearly napping, suddenly ther一cam一a tapping,
As of som一on一gently rapping, rapping at my chamber door.
"'Tis som一visiter," I muttered, "tapping at my chamber door--
Only this and nothing more.'


So we've reduced the number of spaces a bit. Doing one more step, now the most common pair of characters is 'in', which we replace by '丁' obtaining:

Onc一upon a midnight dreary, whil一I pondered, weak and weary,
Over many a qua丁t and curious volum一of forgotten lore--
Whil一I nodded, nearly napp丁g, suddenly ther一cam一a tapp丁g,
As of som一on一gently rapp丁g, rapp丁g at my chamber door.
"'Tis som一visiter," I muttered, "tapp丁g at my chamber door--
Only this and noth丁g more.'

etc.

The end results of the effort are demo-ed here. Feel free to play around with it. For the most part, typing some standard English, I seem to be able to get compression ratios around 5 or so. Let me know how it does for you.

I'll leave you with this final message:
儌咹乺悃巄格丌凣亥乄叜


Python code that I used to do the heavy lifting is available as a gist.

1. But I never got to find out what word comes at the end of this sandwich!

Also, what were the data sets you read in to find the word occurrence rates? Wasn't one of them something funny like conversations overheard in subways? Or did I make that up?

Anyway,

豳嚌宝

2. Yeah, the corpus I used to do the compression was a compilation of:
- the brown corpus - a standard english language corpus with formal english and a lot of legalese

along with
- overheard in new york - as a source of vulgar english
- the script to Monty Python and the Holy Grail
- the script the The Pirates of the Carribean
- wine reviews from wine.com

and maybe a couple other sources in there. This led to some interesting effects, in fact the longest mapping from English characters to a single Chinese character turned out to be

'government of the united states of america ' -> '襁'

probably because the brown corpus has a lot of legal documents

Fun fact, originally I was kind of hoping to be able to fit the whole declaration of independence in a single tweet, that wasn't the kind of compression I realized, but completely accidentally,

'declaration of independence ' -> '齟'

I got a minor victory

'government of the united states of america ' -> '襁' -> swaddling clothes

'declaration of independence ' -> '齟' -> irregular

4. Also:

'corky' -> '唟崘' = 'Windows Vista, Featured'

So there's that.

5. Too bad you didn't use a bunch of tweets as your corpus. You should get a better compression ratio since it would reflect what you're trying to compress more closely than your legalese-based corpus.

Anyway that's a fun experiment. Next step would be to add a header to tweets compressed this way, and create browser plugins (or a greasemonkey script) to automatically deflate compressed tweets.

6. This is interesting to me as I was just thinking about it a few days ago. Though, why limit yourself to just Chinese chars? I'd say go for the whole unicode set as you'll have a wide key space to play with. With that, and I'm totally just talking out of my rear here, try something like this:

1.) set unicode start to 0x0000 or wherever it really starts; same with end

2.) Find some clever way to represent the [keyspace size] number of most common words, then letter combinations (ideally of longer length, but defaulting to shorter to give ~98% coverage of all english words).

3.) If you can get words "close enough", you can have something like aspell run through and auto-correct them for you. You could also do something clever like removing [ most ! /^[aeiouyAEIOUY].*\$/ ] (vowels) in the middle of words -- something that you were obviously thinking about writing this. This would cut down on the lookup table size and allow you to push the compression further, incorporating word pairings and word strings into your key space based on prevalence.

4.) I'd also suggest that you do a random sampling of tweets; maybe 1 or 10 million, to get a good idea of the data that you're trying to compress. If you generate your lookup tables out of that and support ALL unicode chrs, I'm sure you could fit at least a paragraph or two into a single tweet.

This is a pretty interesting problem to me although I'm no compression expert. Please message me on twitter if you'd like to work on something together ( @ip2k ) or if you have any updates to this idea :)

7. Regarding using a corpus of tweets: I thought about it but there were a few issues.
1. They are relatively hard to come by. I know I could roll my own, but still.
2. The goal was to, through compression enable real english communication on twitter. Tweets are short by design, not exactly a stellar example of written english. Also I would be losing out on some of the long range correlations in English that allow compression in the first place.
3. It is nontrivial to get only English tweets.

That said, a good chunk of the corpus I used was from overheardinny, so if you play around with the compressor, you'll notice colorful english is well represented.

Regarding using more of unicode: you're right I could, it's just that a lot of unicode characters are not printable, and those that are do not form a contiguous block. But it was mostly laziness.

I like the ideas for improving the algorithm and making a greasemonkey type extension. I'll have to think about it.

8. If you can use this inside of a custom Twitter client that does compression upon submission and decompression for each post display, you could have something very successful on your hands.

9. Take a look at huffman

10. Regarding Huffman: I'm aware of Huffman codes, but they are variable length codes, so wouldn't work for this application. In theory one could Huffman encode and then convert that stream to unicode, but there are a bunch of nonprintable and control characters you'd have to deal with.

One thing I like about this encoding is that it is still theoretically human readable, you could learn what each of the symbols mean.

11. 唔 墅借唦口噡叝厞嬝三笞崄姩呀乃叟虞呥

12. 弬傦亡叢偒耄唽咠咍叧史唯咀灅袏憺撮俇噶豥咮旮咬憰但丘唰耒叧史唯諜娽仁呌鷖乳g牿嗋浲乐东叧史唯脤廮1簝噯埖穳呌鷖叧史咼七捬圏儅乃赎忯麏呞仌個丂椆侧咒孙允仰叝咬丌唐乐埢螃歘伩允叝衵巖古嚹憺熛丘罛椭崼垄樋产乑允仰熫朗俫闝乆唐了椭吧咮旮叜

13. there was a contest at stackflow of the best compression method for a picture which is decompressed by a twitter message.

this was interesting. your blog posting, is not. especially when you picked a symbol for spaces while words made of letters were exchanged by complete symbols - duh!

14. You can pull 10s of tweets a second off the spritzer API from Twitter:

15. I made a quick Chrome extension (userscript wasn't going to have enough permissions) for this which is up at https://chrome.google.com/webstore/detail/idcnolgflhcckjdfpfbcehjocggffdjk

It's not integrated into twitter.com because I don't use twitter.com to tweet, and I think many people who would use this don't either...