Natural Language Annotation for Machine Learning

Natural Language Annotation for Machine Learning (2013) .. by @jamespusto@amber_stubbs

Table of Contents

Chapter 1 The Basics
The Importance of Language Annotation
A Brief History of Corpus Linguistics
Language Data and Machine Learning
The Annotation Development Cycle

Chapter 2 Defining Your Goal and Dataset
Defining Your Goal
Background Research
Assembling Your Dataset
The Size of Your Corpus

Chapter 3 Corpus Analytics
Basic Probability for Corpus Analytics
Counting Occurrences
Language Models

Chapter 4 Building Your Model and Specification
Some Example Models and Specs
Adopting (or Not Adopting) Existing Models
Different Kinds of Standards

Chapter 5 Applying and Adopting Annotation Standards
Metadata Annotation: Document Classification
Text Extent Annotation: Named Entities
Linked Extent Annotation: Semantic Roles
ISO Standards and You

Chapter 6 Annotation and Adjudication
The Infrastructure of an Annotation Project
Specification Versus Guidelines
Be Prepared to Revise
Preparing Your Data for Annotation
Writing the Annotation Guidelines
Choosing an Annotation Environment
Evaluating the Annotations
Creating the Gold Standard (Adjudication)

Chapter 7 Training: Machine Learning
What Is Learning?
Defining Our Learning Task
Classifier Algorithms
Sequence Induction Algorithms
Clustering and Unsupervised Learning
Semi-Supervised Learning
Matching Annotation to Algorithms

Chapter 8 Testing and Evaluation
Testing Your Algorithm
Evaluating Your Algorithm
Problems That Can Affect Evaluation
Final Testing Scores

Chapter 9 Revising and Reporting
Revising Your Project
Reporting About Your Work

Chapter 10 Annotation: TimeML
The Goal of TimeML
Related Research
Building the Corpus
Model: Preliminary Specifications
Annotation: First Attempts
Model: The TimeML Specification Used in TimeBank
Annotation: The Creation of TimeBank
TimeML Becomes ISO-TimeML
Modeling the Future: Directions for TimeML

Chapter 11 Automatic Annotation: Generating TimeML
The TARSQI Components
Improvements to the TTK
TimeML Challenges: TempEval-2
Future of the TTK

Chapter 12 Afterword: The Future of Annotation
Crowdsourcing Annotation
Handling Big Data
NLP Online and in the Cloud
And Finally…

Appendix List of Available Corpora and Specifications
Specifications, Guidelines, and Other Resources
Representation Standards

Appendix List of Software Resources
Annotation and Adjudication Software
Machine Learning Resources

Appendix MAE User Guide
Installing and Running MAE
Loading Tasks and Files
Saving Files
Defining Your Own Task
Frequently Asked Questions

Appendix MAI User Guide
Installing and Running MAI
Loading Tasks and Files
Saving Files

Appendix Bibliography
References for Using Amazon’s Mechanical Turk/Crowdsourcing