What is a good way to strip a text of language-independent punctuation, like !, ?, and emoticons before trying for language detection?


What is a good way to strip a text of language-independent punctuation, like !, ?, and emoticons before trying for language detection?

This is usually referred to as Text Normalization [1].  See Vineet Yadav’s answer to my Quora question: How would you make an API that converts any tweet into a proper English sentence?  In fact, I use Yahoo! Pipes Regex module for doing this.

[1] http://en.wikipedia.org/wiki/Text_normalization