Natural Language Annotation for Machine Learning (2013) .. by @jamespusto & @amber_stubbs
Table of Contents
Chapter 1 The Basics
The Importance of Language Annotation
A Brief History of Corpus Linguistics
Language Data and Machine Learning
The Annotation Development Cycle
Summary
Chapter 2 Defining Your Goal and Dataset
Defining Your Goal
Background Research
Assembling Your Dataset
The Size of Your Corpus
Summary
Chapter 3 Corpus Analytics
Basic Probability for Corpus Analytics
Counting Occurrences
Language Models
Summary
Chapter 4 Building Your Model and Specification
Some Example Models and Specs
Adopting (or Not Adopting) Existing Models
Different Kinds of Standards
Summary
Chapter 5 Applying and Adopting Annotation Standards
Metadata Annotation: Document Classification
Text Extent Annotation: Named Entities
Linked Extent Annotation: Semantic Roles
ISO Standards and You
Summary
Chapter 6 Annotation and Adjudication
The Infrastructure of an Annotation Project
Specification Versus Guidelines
Be Prepared to Revise
Preparing Your Data for Annotation
Writing the Annotation Guidelines
Annotators
Choosing an Annotation Environment
Evaluating the Annotations
Creating the Gold Standard (Adjudication)
Summary
Chapter 7 Training: Machine Learning
What Is Learning?
Defining Our Learning Task
Classifier Algorithms
Sequence Induction Algorithms
Clustering and Unsupervised Learning
Semi-Supervised Learning
Matching Annotation to Algorithms
Summary
Chapter 8 Testing and Evaluation
Testing Your Algorithm
Evaluating Your Algorithm
Problems That Can Affect Evaluation
Final Testing Scores
Summary
Chapter 9 Revising and Reporting
Revising Your Project
Reporting About Your Work
Summary
Chapter 10 Annotation: TimeML
The Goal of TimeML
Related Research
Building the Corpus
Model: Preliminary Specifications
Annotation: First Attempts
Model: The TimeML Specification Used in TimeBank
Annotation: The Creation of TimeBank
TimeML Becomes ISO-TimeML
Modeling the Future: Directions for TimeML
Summary
Chapter 11 Automatic Annotation: Generating TimeML
The TARSQI Components
Improvements to the TTK
TimeML Challenges: TempEval-2
Future of the TTK
Summary
Chapter 12 Afterword: The Future of Annotation
Crowdsourcing Annotation
Handling Big Data
NLP Online and in the Cloud
And Finally…
Appendix List of Available Corpora and Specifications
Corpora
Specifications, Guidelines, and Other Resources
Representation Standards
Appendix List of Software Resources
Annotation and Adjudication Software
Machine Learning Resources
Appendix MAE User Guide
Installing and Running MAE
Loading Tasks and Files
Saving Files
Defining Your Own Task
Frequently Asked Questions
Appendix MAI User Guide
Installing and Running MAI
Loading Tasks and Files
Adjudicating
Saving Files
Appendix Bibliography
References for Using Amazon’s Mechanical Turk/Crowdsourcing
Colophon