Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your Data (2016) .. by Dipanjan Sarkar
Contents
About the Author … xv
About the Technical Reviewer … xvii
Acknowledgments … xix
Introduction … xxi
Chapter 1: Natural Language Basics … 1
Natural Language … 2
What Is Natural Language? … 2
The Philosophy of Language … 2
Language Acquisition and Usage … 5
Linguistics … 8
Language Syntax and Structure … 10
Words … 11
Phrases … 12
Clauses … 14
Grammar … 15
Word Order Typology … 23
Language Semantics … 25
Lexical Semantic Relations … 25
Semantic Networks and Models … 28
Representation of Semantics … 29
Text Corpora … 37
Corpora Annotation and Utilities … 38
Popular Corpora … 39
Accessing Text Corpora … 40
Natural Language Processing … 46
Machine Translation … 46
Speech Recognition Systems … 47
Question Answering Systems … 47
Contextual Recognition and Resolution … 48
Text Summarization … 48
Text Categorization … 49
Text Analytics … 49
Summary … 50
Chapter 2: Python Refresher … 51
Getting to Know Python … 51
The Zen of Python … 54
Applications: When Should You Use Python … 55
Drawbacks: When Should You Not Use Python? … 58
Python Implementations and Versions … 59
Installation and Setup … 60
Which Python Version? … 60
Which Operating System? … 61
Integrated Development Environments … 61
Environment Setup … 62
Virtual Environments … 64
Python Syntax and Structure … 66
Data Structures and Types … 69
Numeric Types … 70
Strings … 72
Lists … 73
Sets … 74
Dictionaries … 75
Tuples … 76
Files … 77
Miscellaneous … 78
Controlling Code Flow … 78
Conditional Constructs … 79
Looping Constructs … 80
Handling Exceptions … 82
Functional Programming … 84
Functions … 84
Recursive Functions … 85
Anonymous Functions … 86
Iterators … 87
Comprehensions … 88
Generators … 90
The itertools and functools Modules … 91
Classes … 91
Working with Text … 94
String Literals … 94
String Operations and Methods … 96
Text Analytics Frameworks … 104
Summary … 106
Chapter 3: Processing and Understanding Text … 107
Text Tokenization … 108
Sentence Tokenization … 108
Word Tokenization … 112
Text Normalization … 115
Cleaning Text … 115
Tokenizing Text … 116
Removing Special Characters … 116
Expanding Contractions … 118
Case Conversions … 119
Removing Stopwords … 120
Correcting Words … 121
Stemming … 128
Lemmatization … 131
Understanding Text Syntax and Structure … 132
Installing Necessary Dependencies … 133
Important Machine Learning Concepts … 134
Parts of Speech (POS) Tagging … 135
Shallow Parsing … 143
Dependency-based Parsing … 153
Constituency-based Parsing … 158
Summary … 165
Chapter 4: Text Classification … 167
What Is Text Classification … 168
Automated Text Classification … 170
Text Classification Blueprint … 172
Text Normalization … 174
Feature Extraction … 177
Bag of Words Model … 179
TF-IDF Model … 181
Advanced Word Vectorization Models … 187
Classification Algorithms … 193
Multinomial Naïve Bayes … 195
Support Vector Machines … 197
Evaluating Classification Models … 199
Building a Multi-Class Classification System … 204
Applications and Uses … 214
Summary … 215
Chapter 5: Text Summarization … 217
Text Summarization and Information Extraction … 218
Important Concepts … 220
Documents … 220
Text Normalization … 220
Feature Extraction … 221
Feature Matrix … 221
Singular Value Decomposition … 221
Text Normalization … 223
Feature Extraction … 224
Keyphrase Extraction … 225
Collocations … 226
Weighted Tag–Based Phrase Extraction … 230
Topic Modeling … 234
Latent Semantic Indexing … 235
Latent Dirichlet Allocation … 241
Non-negative Matrix Factorization … 245
Extracting Topics from Product Reviews … 246
Automated Document Summarization … 250
Latent Semantic Analysis … 253
TextRank … 256
Summarizing a Product Description … 261
Summary … 263
Chapter 6: Text Similarity and Clustering … 265
Important Concepts … 266
Information Retrieval (IR) … 266
Feature Engineering … 267
Similarity Measures … 267
Unsupervised Machine Learning Algorithms … 268
Text Normalization … 268
Feature Extraction … 270
Text Similarity … 271
Analyzing Term Similarity … 271
Hamming Distance … 274
Manhattan Distance … 275
Euclidean Distance … 277
Levenshtein Edit Distance … 278
Cosine Distance and Similarity … 283
Analyzing Document Similarity … 285
Cosine Similarity … 287
Hellinger-Bhattacharya Distance … 289
Okapi BM25 Ranking … 292
Document Clustering … 296
Clustering Greatest Movies of All Time … 299
K-means Clustering … 301
Affinity Propagation … 308
Ward’s Agglomerative Hierarchical Clustering … 313
Summary … 317
Chapter 7: Semantic and Sentiment Analysis … 319
Semantic Analysis … 320
Exploring WordNet … 321
Understanding Synsets … 321
Analyzing Lexical Semantic Relations … 323
Word Sense Disambiguation … 330
Named Entity Recognition … 332
Analyzing Semantic Representations … 336
Propositional Logic … 336
First Order Logic … 338
Sentiment Analysis … 342
Sentiment Analysis of IMDb Movie Reviews … 343
Setting Up Dependencies … 343
Preparing Datasets … 347
Supervised Machine Learning Technique … 348
Unsupervised Lexicon-based Techniques … 352
Comparing Model Performances … 374
Summary … 376
Index … 377