Python 3 Text Processing with NLTK 3 Cookbook (2014) .. by @japerk
Table of Contents
Preface 1
Chapter 1: Tokenizing Text and WordNet Basics 7
Introduction 7
Tokenizing text into sentences 8
Tokenizing sentences into words 10
Tokenizing sentences using regular expressions 12
Training a sentence tokenizer 14
Filtering stopwords in a tokenized sentence 16
Looking up Synsets for a word in WordNet 18
Looking up lemmas and synonyms in WordNet 20
Calculating WordNet Synset similarity 23
Discovering word collocations 25
Chapter 2: Replacing and Correcting Words 29
Introduction 29
Stemming words 30
Lemmatizing words with WordNet 32
Replacing words matching regular expressions 34
Removing repeating characters 37
Spelling correction with Enchant 39
Replacing synonyms 43
Replacing negations with antonyms 46
Chapter 3: Creating Custom Corpora 49
Introduction 49
Setting up a custom corpus 50
Creating a wordlist corpus 52
Creating a part-of-speech tagged word corpus 55
Creating a chunked phrase corpus 59
Creating a categorized text corpus 64
Creating a categorized chunk corpus reader 66
Lazy corpus loading 73
Creating a custom corpus view 75
Creating a MongoDB-backed corpus reader 79
Corpus editing with file locking 82
Chapter 4: Part-of-speech Tagging 85
Introduction 85
Default tagging 86
Training a unigram part-of-speech tagger 89
Combining taggers with backoff tagging 92
Training and combining ngram taggers 94
Creating a model of likely word tags 97
Tagging with regular expressions 99
Affix tagging 100
Training a Brill tagger 102
Training the TnT tagger 105
Using WordNet for tagging 107
Tagging proper names 110
Classifier-based tagging 111
Training a tagger with NLTK-Trainer 114
Chapter 5: Extracting Chunks 123
Introduction 123
Chunking and chinking with regular expressions 124
Merging and splitting chunks with regular expressions 130
Expanding and removing chunks with regular expressions 133
Partial parsing with regular expressions 136
Training a tagger-based chunker 139
Classification-based chunking 143
Extracting named entities 147
Extracting proper noun chunks 149
Extracting location chunks 151
Training a named entity chunker 154
Training a chunker with NLTK-Trainer 156
Chapter 6: Transforming Chunks and Trees 163
Introduction 163
Filtering insignificant words from a sentence 164
Correcting verb forms 166
Swapping verb phrases 169
Swapping noun cardinals 170
Swapping infinitive phrases 172
Singularizing plural nouns 173
Chaining chunk transformations 174
Converting a chunk tree to text 176
Flattening a deep tree 177
Creating a shallow tree 181
Converting tree labels 183
Chapter 7: Text Classification 187
Introduction 187
Bag of words feature extraction 188
Training a Naive Bayes classifier 191
Training a decision tree classifier 197
Training a maximum entropy classifier 201
Training scikit-learn classifiers 205
Measuring precision and recall of a classifier 210
Calculating high information words 214
Combining classifiers with voting 219
Classifying with multiple binary classifiers 221
Training a classifier with NLTK-Trainer 228
Chapter 8: Distributed Processing and Handling Large Datasets 237
Introduction 237
Distributed tagging with execnet 238
Distributed chunking with execnet 242
Parallel list processing with execnet 244
Storing a frequency distribution in Redis 247
Storing a conditional frequency distribution in Redis 251
Storing an ordered dictionary in Redis 253
Distributed word scoring with Redis and execnet 257
Chapter 9: Parsing Specific Data Types 263
Introduction 263
Parsing dates and times with dateutil 264
Timezone lookup and conversion 266
Extracting URLs from HTML with lxml 269
Cleaning and stripping HTML 271
Converting HTML entities with BeautifulSoup 272
Detecting and converting character encodings 274
Appendix: Penn Treebank Part-of-speech Tags 277
Index 279