Why Have Decoder-Only Architectures Become the Standard in Large Language Models? (Popularized 2020-2023)

Background

Currently, most leading large language models (LLMs) are based on decoder-only architectures. In the past, encoder-only and encoder-decoder structures also held prominence, yet today, the trend has shifted toward decoder-only designs. This article explores the reasons behind this shift.

Core Architectural Designs

1. Encoder-Only Architecture (BERT-style)

Encoder-only models like BERT utilize bidirectional self-attention mechanisms to process input sequences. These models excel at understanding context from both directions but cannot generate text. Key characteristics include:

Bidirectional attention allows tokens to attend to both past and future context
Masked Language Modeling (MLM) as the primary pre-training objective
Optimal for classification, token-level tasks, and semantic understanding
Examples: BERT, RoBERTa, DeBERTa

2. Encoder-Decoder Architecture (Seq2Seq)

Models like T5 and BART implement both encoding and decoding components, making them suitable for transformation tasks. Technical features include:

Cross-attention mechanism between encoder and decoder
Encoder processes input bidirectionally
Decoder generates output autoregressively
Examples: T5, BART, mT5

3. Decoder-Only Architecture (GPT-style)

Modern LLMs predominantly use this architecture, characterized by:

Causal (unidirectional) self-attention mechanism
Next-token prediction as the primary training objective
Unified approach to both understanding and generation
Examples: GPT-3, LLaMA, Claude

Technical Advantages of Decoder-Only Architectures

1. Mathematical Properties of Causal Attention

Decoder-only architectures implement causal attention, which offers several mathematical advantages:

Forms a lower triangular attention matrix with non-zero diagonal elements
Guarantees full rank properties (det(A) ≠ 0)
Enables stable gradient flow during training
Maintains information preservation through each layer

2. Computational Efficiency

The architecture provides significant computational benefits:

O(n) complexity for inference vs O(n²) for bidirectional attention
Efficient KV-cache implementation for sequential processing
Reduced memory footprint during inference
Better parallelization capabilities during training

3. Advanced Context Processing

Technical aspects of context handling include:

Implicit in-context learning through attention patterns
Efficient gradient flow through causal attention layers
Enhanced ability to maintain long-range dependencies
Superior performance in few-shot and zero-shot scenarios

4. Training Dynamics

Unique training characteristics include:

More stable optimization landscape due to causal structure
Better scaling properties with model size
Improved gradient flow through deeper networks
More efficient parameter utilization

Conclusion

Given their advantages in training efficiency, engineering implementation, and theoretical properties, decoder-only architectures have become the mainstream choice for LLM design. Particularly in generative tasks, introducing bidirectional attention offers no substantial benefits, and encoder-decoder models only outperform in certain cases due to larger parameter counts. With equivalent parameter sizes and inference costs, the decoder-only architecture stands out as the optimal choice.

References

Zhihu.com