How hard would it be to use automatic summarization to summarize one’s tweets?
I have spent years working on this, and have written a good bit about it on Quora already, see below. It would be a LOT easier if people wrote correct English in tweets; however, tweets are by and large gibberish. And, with all the various kinds of re-tweeting, require a massive amount of de-duplication. De-duplication is required both before normalization, and after normalization, which quickly becomes a BIG data problem. (My definition of BIG data is, too big for you to do by yourself on your own hardware.) Normalization is in effect converting gibberish, including translating SMS-speak and Twitter-ese, into proper English.
Then you need to decide which summarization algorithms best serve your purpose. In my case, this quickly moved from automatic summarization to Natural language generation; in other words, you need to build something up in order to break it down. Given the fact that each and every thing can be said a thousand different ways on Twitter, not to mention in myriad different languages (see also Code-mixing), this Rubik’s Cube on steroids can become a nightmare scenario. IMHO, without a decent size team, and without reasonable funding, this is a moderately hard task (which is not to say impossible). 🙂
See also my Quora answers to:
- What are the current sub problems that I could address in summarization, or what would you like a summarizer to do for you?
- Why is it not a good idea to generate paragraphs from a Twitter stream about a selected topic?
- Is there any NLP library/tool/API for cleansing noisy text (e.g. SMS, Twitter) in English to its correct text?
- What are the best blogs talking about natural language processing?
- Can collection of content and data be curated (altered in a comprehendable way) automatically by programming?
- What’s the best site to find articles about artificial intelligence?