Early Language Models Timeline

Notes:

In the 2000s, neural networks shifted from exploratory to influential in NLP: Bengio et al. (2001; JMLR 2003) introduced the neural probabilistic language model that jointly learned distributed word vectors with a feed-forward predictor; scalability improved via hierarchical softmax and tree factorizations (Morin & Bengio, 2005) and log-bilinear models (Mnih & Hinton, late-2000s) to lower normalization costs over large vocabularies; Collobert & Weston (2008) showed a single deep architecture with shared representations could support tagging, chunking, NER, and SRL; practical CUDA-class GPUs by decade’s end sped training despite modest datasets and models; building on 1990s connectionist work (Elman/Jordan RNNs, RAAM, TDNNs), these advances preceded reusable stand-alone embeddings, as learned vectors remained internal and task-tied, bridging toward the embedding-centric era.

See also:

LLM Evolution Timeline

[Aug 2025]

2001 (published 2003): Bengio et al. introduce the neural probabilistic language model, jointly learning distributed word vectors and a feed-forward predictor as a smoother alternative to n-grams.
2005: Morin and Bengio propose tree-based factorization (hierarchical softmax) to reduce the computational cost of normalizing over large vocabularies.
2006–2007: Unsupervised pretraining and renewed interest in deep architectures broaden feasibility for neural NLP, though datasets and models remain small by later standards.
2007–2009: Mnih and Hinton develop log-bilinear and hierarchical approaches that further improve scalability and training efficiency for neural language modeling.
2008: Collobert and Weston demonstrate a unified deep architecture with shared representations handling POS tagging, chunking, NER, and SRL within one framework.
2009: Increasing practicality of CUDA-class GPUs accelerates training and experimentation with neural models for language.
Late 2000s: Learned vectors are primarily internal, task-tied parameters rather than portable, stand-alone embeddings, setting the stage for the embedding-centric era of the 2010s.