A living reference table comparing the internal architectural choices of large language models from the original Transformer (2017) through the latest frontier models. Ordered newest β oldest.
Each model row covers:
| Column | What it captures |
|---|---|
| Norm | LayerNorm vs RMSNorm |
| Parallel Layer | Attention + FFN in parallel vs. serial |
| Pre-norm | Norm before sub-layer (Pre) / after (Post) / both |
| Pos. Embedding | Sine / Absolute / Relative / RoPE / ALiBi / NoPE / iRoPE |
| Activation | ReLU / GeLU / SwiGLU / GeGLU / SqReLU |
| Attn. Type | MHA / MQA / GQA / MLA / SWA |
| Context Len. | Native and extended token limits |
| MoE | Dense or Sparse (total / active params, expert count) |
| Bias Terms | Bias in attention projections and/or norms |
| Tied Emb. | Input embedding tied to output projection |
| QK-Norm | RMSNorm on Q and K inside attention |
| Sliding Window | Local sliding-window attention in some layers |
| Stability Tricks | Z-loss, MTP, logit capping, etc. |
| Ref | Primary paper / technical report link |
table.md β the main comparison table
MAINTENANCE.md β how to add or update models
- Original figure: Harm de Vries (2017β2024 models)
- Extended coverage: Sebastian Raschka, The Big LLM Architecture Comparison (updated Mar 6, 2026)
- Individual technical reports and HuggingFace model configs for each model