Text Analytics with Python


Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your Data (2016) .. by Dipanjan Sarkar


Contents

About the Author … xv
About the Technical Reviewer … xvii
Acknowledgments … xix
Introduction … xxi

Chapter 1: Natural Language Basics … 1

Natural Language … 2

What Is Natural Language? … 2
The Philosophy of Language … 2
Language Acquisition and Usage … 5

Linguistics … 8

Language Syntax and Structure … 10

Words … 11
Phrases … 12
Clauses … 14
Grammar … 15
Word Order Typology … 23

Language Semantics … 25

Lexical Semantic Relations … 25
Semantic Networks and Models … 28
Representation of Semantics … 29

Text Corpora … 37

Corpora Annotation and Utilities … 38
Popular Corpora … 39
Accessing Text Corpora … 40

Natural Language Processing … 46

Machine Translation … 46
Speech Recognition Systems … 47
Question Answering Systems … 47
Contextual Recognition and Resolution … 48
Text Summarization … 48
Text Categorization … 49

Text Analytics … 49

Summary … 50

Chapter 2: Python Refresher … 51

Getting to Know Python … 51

The Zen of Python … 54
Applications: When Should You Use Python … 55
Drawbacks: When Should You Not Use Python? … 58
Python Implementations and Versions … 59

Installation and Setup … 60

Which Python Version? … 60
Which Operating System? … 61
Integrated Development Environments … 61
Environment Setup … 62
Virtual Environments … 64

Python Syntax and Structure … 66

Data Structures and Types … 69

Numeric Types … 70
Strings … 72
Lists … 73
Sets … 74
Dictionaries … 75
Tuples … 76
Files … 77
Miscellaneous … 78

Controlling Code Flow … 78

Conditional Constructs … 79
Looping Constructs … 80
Handling Exceptions … 82

Functional Programming … 84

Functions … 84
Recursive Functions … 85
Anonymous Functions … 86
Iterators … 87
Comprehensions … 88
Generators … 90
The itertools and functools Modules … 91

Classes … 91

Working with Text … 94

String Literals … 94
String Operations and Methods … 96

Text Analytics Frameworks … 104

Summary … 106

Chapter 3: Processing and Understanding Text … 107

Text Tokenization … 108

Sentence Tokenization … 108
Word Tokenization … 112

Text Normalization … 115

Cleaning Text … 115
Tokenizing Text … 116
Removing Special Characters … 116
Expanding Contractions … 118
Case Conversions … 119
Removing Stopwords … 120
Correcting Words … 121
Stemming … 128
Lemmatization … 131

Understanding Text Syntax and Structure … 132

Installing Necessary Dependencies … 133
Important Machine Learning Concepts … 134
Parts of Speech (POS) Tagging … 135
Shallow Parsing … 143
Dependency-based Parsing … 153
Constituency-based Parsing … 158

Summary … 165

Chapter 4: Text Classification … 167

What Is Text Classification … 168

Automated Text Classification … 170

Text Classification Blueprint … 172

Text Normalization … 174

Feature Extraction … 177

Bag of Words Model … 179
TF-IDF Model … 181
Advanced Word Vectorization Models … 187

Classification Algorithms … 193

Multinomial Naïve Bayes … 195
Support Vector Machines … 197

Evaluating Classification Models … 199

Building a Multi-Class Classification System … 204

Applications and Uses … 214

Summary … 215

Chapter 5: Text Summarization … 217

Text Summarization and Information Extraction … 218

Important Concepts … 220

Documents … 220
Text Normalization … 220
Feature Extraction … 221
Feature Matrix … 221
Singular Value Decomposition … 221

Text Normalization … 223

Feature Extraction … 224

Keyphrase Extraction … 225

Collocations … 226
Weighted Tag–Based Phrase Extraction … 230

Topic Modeling … 234

Latent Semantic Indexing … 235
Latent Dirichlet Allocation … 241
Non-negative Matrix Factorization … 245
Extracting Topics from Product Reviews … 246

Automated Document Summarization … 250

Latent Semantic Analysis … 253
TextRank … 256
Summarizing a Product Description … 261

Summary … 263

Chapter 6: Text Similarity and Clustering … 265

Important Concepts … 266

Information Retrieval (IR) … 266
Feature Engineering … 267
Similarity Measures … 267
Unsupervised Machine Learning Algorithms … 268

Text Normalization … 268

Feature Extraction … 270

Text Similarity … 271

Analyzing Term Similarity … 271

Hamming Distance … 274
Manhattan Distance … 275
Euclidean Distance … 277
Levenshtein Edit Distance … 278
Cosine Distance and Similarity … 283

Analyzing Document Similarity … 285

Cosine Similarity … 287
Hellinger-Bhattacharya Distance … 289
Okapi BM25 Ranking … 292

Document Clustering … 296

Clustering Greatest Movies of All Time … 299

K-means Clustering … 301
Affinity Propagation … 308
Ward’s Agglomerative Hierarchical Clustering … 313

Summary … 317

Chapter 7: Semantic and Sentiment Analysis … 319

Semantic Analysis … 320

Exploring WordNet … 321

Understanding Synsets … 321
Analyzing Lexical Semantic Relations … 323

Word Sense Disambiguation … 330

Named Entity Recognition … 332

Analyzing Semantic Representations … 336

Propositional Logic … 336
First Order Logic … 338

Sentiment Analysis … 342

Sentiment Analysis of IMDb Movie Reviews … 343

Setting Up Dependencies … 343
Preparing Datasets … 347
Supervised Machine Learning Technique … 348
Unsupervised Lexicon-based Techniques … 352
Comparing Model Performances … 374

Summary … 376

Index … 377