See also:
[Sep 2025]
From N-Grams and RNNs to Transformers and Multimodal Successors
Before the 2017 Transformer, machines handled sequences mainly in two waves: first, statistical methods like n-grams and phrase-based translation stitched text together using hand-built rules and counts; then neural networks took over with recurrent models (RNNs, later LSTMs and GRUs) that read one token at a time and tried to “remember” what came before. Around 2014–2016, encoder–decoder setups with an add-on called attention let the decoder look back at the most relevant parts of the input, which helped but still required slow, step-by-step processing. In parallel, convolutional models tried to speed things up with stacks of filters that could read many tokens at once, but they struggled to capture very long context without becoming deep and heavy. Memory-augmented ideas and word embeddings improved pieces of the puzzle, yet training remained hard to parallelize and long-range dependencies were fragile. The Transformer arrived by making attention the main engine—no recurrence, explicit positional cues, and full parallelism—solving the bottlenecks those earlier families couldn’t.
Back in 2017, a group of researchers asked a simple question: instead of reading text one word at a time and trying to remember everything, what if a model could glance at the whole sentence at once and decide which words matter to each other? They built the Transformer to do exactly that. It comes in two halves: an encoder that reads and forms a rich “understanding” of the input, and a decoder that uses that understanding to produce the output. The trick is attention, a scoring system that lets each word focus on the most relevant other words, plus a sense of order so the model knows who came first, and stabilizers so training doesn’t blow up. Because all words are processed in parallel, it trains fast and handles long passages far better than older step-by-step networks. That core encoder–decoder design from “Attention Is All You Need” became the blueprint: stripped to just the encoder it powers understanding models like BERT; stripped to just the decoder it powers generators like GPT; kept as both, it excels at translation and summarization. That is the origin and shape of the Transformer architecture introduced by Vaswani and colleagues.
After 2017, the blueprint split into three tracks that matured quickly: encoder-only models specialized in understanding and search (e.g., masked-language pretraining for classification and embeddings), decoder-only models scaled up for fluent generation (next-token prediction enabling chat, coding, and writing), and encoder–decoder models refined for translation and summarization under a unified “text-to-text” view; training shifted to massive pretraining followed by task adaptation via fine-tuning, instruction tuning, and reinforcement learning from human feedback; practical systems added retrieval so models could look things up rather than memorize everything, while engineering focused on longer context and lower cost through efficient attention kernels, key–value caching, sparse mixture-of-experts routing, and position schemes that extend sequence length; transformers then expanded beyond text to vision, audio, and multimodal pipelines, powering captioning, image and video understanding, and generation with paired text–image training; in parallel, researchers explored “post-transformer” or hybrid sequence models such as state-space approaches and RWKV to better handle very long sequences or reduce compute, but the core attention-based transformer remained the standard foundation that most modern large models build upon.