What is a good way to strip a text of language-independent punctuation, like !, ?, and emoticons before trying for language detection?
This is usually referred to as Text Normalization . See Vineet Yadav’s answer to my Quora question: How would you make an API that converts any tweet into a proper English sentence? In fact, I use Yahoo! Pipes Regex module for doing this.
« I’d like to build a talking animatronic Beethoven bust for my music room, and don’t even know where to begin, how to get started?