The systematic study of text is known as Corpus linguistics. Generally, a corpus can be parsed into and represented as XML, which can then be processed using regular expressions. Wikipedia contains an extensive listing of Natural language processing toolkits, under Outline of natural language processing.