AI Engineering
Production AI architecture and engineering practices.
By Yangming Li
Currently, most leading large language models (LLMs) are based on decoder-only architectures. In the past, encoder-only and encoder-decoder structures also held prominence, yet today, the trend has shifted toward decoder-only designs. This article explores the reasons behind this shift.
Encoder-only models like BERT utilize bidirectional self-attention mechanisms to process input sequences. These models excel at understanding context from both directions but cannot generate text. Key characteristics include:
Models like T5 and BART implement both encoding and decoding components, making them suitable for transformation tasks. Technical features include:
Modern LLMs predominantly use this architecture, characterized by:
Decoder-only architectures implement causal attention, which offers several mathematical advantages:
The architecture provides significant computational benefits:
Technical aspects of context handling include:
Unique training characteristics include:
Given their advantages in training efficiency, engineering implementation, and theoretical properties, decoder-only architectures have become the mainstream choice for LLM design. Particularly in generative tasks, introducing bidirectional attention offers no substantial benefits, and encoder-decoder models only outperform in certain cases due to larger parameter counts. With equivalent parameter sizes and inference costs, the decoder-only architecture stands out as the optimal choice.
Zhihu.com