What is a good way to strip a text of language-independent punctuation, like !, ?, and emoticons before trying for language detection?
Browse: Home
/ quora
/ What is a good way to strip a text of language-independent punctuation, like !, ?, and emoticons before trying for language detection?
What is a good way to strip a text of language-independent punctuation, like !, ?, and emoticons before trying for language detection?
This is usually referred to as Text Normalization [1]. See Vineet Yadav’s answer to my Quora question: How would you make an API that converts any tweet into a proper English sentence? In fact, I use Yahoo! Pipes Regex module for doing this.
[1] http://en.wikipedia.org/wiki/Text_normalization