Natural Language Annotation for Machine Learning


Natural Language Annotation for Machine Learning (2013) .. by @jamespusto@amber_stubbs


Table of Contents

Chapter 1 The Basics
The Importance of Language Annotation
A Brief History of Corpus Linguistics
Language Data and Machine Learning
The Annotation Development Cycle
Summary

Chapter 2 Defining Your Goal and Dataset
Defining Your Goal
Background Research
Assembling Your Dataset
The Size of Your Corpus
Summary

Chapter 3 Corpus Analytics
Basic Probability for Corpus Analytics
Counting Occurrences
Language Models
Summary

Chapter 4 Building Your Model and Specification
Some Example Models and Specs
Adopting (or Not Adopting) Existing Models
Different Kinds of Standards
Summary

Chapter 5 Applying and Adopting Annotation Standards
Metadata Annotation: Document Classification
Text Extent Annotation: Named Entities
Linked Extent Annotation: Semantic Roles
ISO Standards and You
Summary

Chapter 6 Annotation and Adjudication
The Infrastructure of an Annotation Project
Specification Versus Guidelines
Be Prepared to Revise
Preparing Your Data for Annotation
Writing the Annotation Guidelines
Annotators
Choosing an Annotation Environment
Evaluating the Annotations
Creating the Gold Standard (Adjudication)
Summary

Chapter 7 Training: Machine Learning
What Is Learning?
Defining Our Learning Task
Classifier Algorithms
Sequence Induction Algorithms
Clustering and Unsupervised Learning
Semi-Supervised Learning
Matching Annotation to Algorithms
Summary

Chapter 8 Testing and Evaluation
Testing Your Algorithm
Evaluating Your Algorithm
Problems That Can Affect Evaluation
Final Testing Scores
Summary

Chapter 9 Revising and Reporting
Revising Your Project
Reporting About Your Work
Summary

Chapter 10 Annotation: TimeML
The Goal of TimeML
Related Research
Building the Corpus
Model: Preliminary Specifications
Annotation: First Attempts
Model: The TimeML Specification Used in TimeBank
Annotation: The Creation of TimeBank
TimeML Becomes ISO-TimeML
Modeling the Future: Directions for TimeML
Summary

Chapter 11 Automatic Annotation: Generating TimeML
The TARSQI Components
Improvements to the TTK
TimeML Challenges: TempEval-2
Future of the TTK
Summary

Chapter 12 Afterword: The Future of Annotation
Crowdsourcing Annotation
Handling Big Data
NLP Online and in the Cloud
And Finally…

Appendix List of Available Corpora and Specifications
Corpora
Specifications, Guidelines, and Other Resources
Representation Standards

Appendix List of Software Resources
Annotation and Adjudication Software
Machine Learning Resources

Appendix MAE User Guide
Installing and Running MAE
Loading Tasks and Files
Saving Files
Defining Your Own Task
Frequently Asked Questions

Appendix MAI User Guide
Installing and Running MAI
Loading Tasks and Files
Adjudicating
Saving Files

Appendix Bibliography
References for Using Amazon’s Mechanical Turk/Crowdsourcing

Colophon