Decoding LLM Model Names

What does "3B" mean? Why is it sometimes "120B" and sometimes "235B-A22B"? Learn how to read modern model names like a pro.

0 / 7 sections
1

What does the "B" in a model name actually mean?

The B stands for billion, and the number in front of it is the model's parameter count — the number of learnable values inside the neural network.

3B
3,000,000,000 parameters
20B
20,000,000,000 parameters
120B
120,000,000,000 parameters
1T
1,000,000,000,000 parameters

Each parameter is a single floating-point number that the model learned during training. More parameters = more "knobs" the model can use to encode patterns from its training data.

Quick rule: The number before B is parameters in billions. M = millions, T = trillions. So gpt-oss-20b has 20 billion parameters, and Llama-3.1-405B has 405 billion.
Try it — click a model name to decode it
Pick a model above ↑
Each colored chunk has a meaning — they break down a model's identity.
Family
Version
Total params
Active params
Variant/structure
Check your understanding
A model labeled llama-3.2-3b has how many parameters?
2

What is a parameter, really?

A neural network is made of layers of "neurons." Each connection between neurons has a number called a weight. Each neuron also has a bias. Weights + biases = parameters.

When the model processes text, it multiplies your input by these weights, adds biases, and passes the result through the next layer. The values of those weights are what was learned during training — they encode everything the model "knows."

A tiny network — every line is one parameter
This toy network: 20 parameters (3×4 weights + 4×2 weights = 20).
A 3B model has 3,000,000,000 of these. A 120B model? 40× more.
Why this matters: Every parameter takes memory. At common 16-bit precision (FP16/BF16), each parameter = 2 bytes. So a 7B model needs roughly 14 GB just to hold its weights. A 70B model? 140 GB.
Check your understanding
What does training actually do to these parameters?
3

The "A3B" mystery — Total vs Active parameters

Names like Qwen3-30B-A3B have two numbers. They tell you the model is a Mixture of Experts (MoE).

  • 30B = total parameters (all experts combined)
  • A3B = active parameters per token (only ~3B run for any given word)

An MoE model is split into many small "expert" sub-networks. For each token of input, a tiny router picks just a few experts to run. The rest stay idle.

MoE in action — only some experts fire per token
Token: "The"
▼ Router decides which experts to use ▼
30B
Total params loaded
3B
Active this token
~10×
Faster than dense
The tradeoff: An MoE model gives you the knowledge of a 30B model with the speed and compute of roughly a 3B model — but you still need to load all 30B into VRAM. So it's fast but memory-hungry.

How to read these names:

NameTotalActiveType
Qwen3-30B-A3B30B3BMoE
Qwen3-235B-A22B235B22BMoE
DeepSeek-V3-671B-A37B671B37BMoE
Mixtral-8x7B~47B~13BMoE (8 experts, 2 active)
Llama-3.1-70B70B70BDense
gpt-oss-20b20B20BDense
Check your understanding
If I run Qwen3-235B-A22B, how much VRAM should I plan for?
4

How much memory does each size actually need?

Memory required ≈ parameter count × bytes-per-parameter. Common formats:

FP32
4 bytes/param
FP16 / BF16
2 bytes/param
INT8
1 byte/param
INT4 (quantized)
0.5 bytes/param
VRAM needed at FP16 — pick a quant level
Why people quantize: Going from FP16 → INT4 cuts memory by with small quality loss. That's how a 70B model fits on a 48 GB GPU instead of needing 140 GB.
Check your understanding
At INT4 quantization, roughly how much VRAM does a 7B model need?
5

What FP16, BF16, INT8, INT4 actually mean

Each parameter is a single number. How that number is stored — how many bits it uses and what shape those bits take — determines both memory size and precision.

  • FP = Floating Point — like scientific notation. Sign + exponent + mantissa.
  • INT = Integer — whole numbers in a fixed range, no fractions.
  • The number after = how many bits each parameter takes.
Click a format to see how a parameter is stored
FP32 32 bits · 4 bytes/param
FP16 (half precision) 16 bits · 2 bytes/param
BF16 (brain float) 16 bits · 2 bytes/param
INT8 8 bits · 1 byte/param
INT4 (4-bit quantized) 4 bits · 0.5 bytes/param
sign (±) exponent (range) mantissa (precision) integer value

The key insight

Neural network weights are mostly small numbers clustered around zero. You don't need full FP32 precision to represent them — you can store many of them in 4 or 8 bits with surprisingly little quality loss. That's quantization.

FP16 vs BF16: Both are 2 bytes, but they split the bits differently. BF16 keeps FP32's wide range (8 exponent bits) but cuts precision (7 mantissa bits). FP16 keeps more precision (10 mantissa bits) but has a smaller range (5 exponent bits). BF16 is preferred for training because it can hold the wide range of gradient values without overflowing.
Why INT4 works: 4 bits = only 16 possible values. But by storing a scale factor per group of weights (e.g., every 32 weights share one FP16 multiplier), you effectively get many more "virtual" values. Modern formats like GPTQ, AWQ, GGUF Q4_K_M use clever grouping to keep quality high.
Check your understanding
Why is BF16 often preferred over FP16 for training large models?
6

Bigger isn't always better — size vs capability

Larger models generally know more facts and reason better — but they're slower, pricier to run, and not always smarter on specific tasks. Here's the practical view:

Small (1B–3B)
Phone
Runs on laptop/mobile. Great for autocomplete, classification, structured tasks.
Mid (7B–13B)
Laptop
Runs on consumer GPUs. Solid general chat, code, summarization.
Large (30B–70B)
Workstation
Needs 1–2 high-end GPUs. Strong reasoning, complex code.
Frontier (100B+)
Data center
Multi-GPU clusters. PhD-level reasoning, multimodal, long context.

The MoE sweet spot

MoE models like Qwen3-30B-A3B hit a unique tradeoff: knowledge of a large dense model, runtime cost of a small one. The catch is VRAM — you still need to load all the experts, even if only a few run per token.

Practical takeaway: Pick the smallest model that does your task well. A well-tuned 7B often beats a generic 70B on narrow jobs, while costing 10× less to serve.
Check your understanding
You want a model for autocomplete in a mobile keyboard app. Best pick?
7

Putting it all together — decoding real model names

Real model names cram a lot in. A typical format looks like this:

family - version - totalB - AactiveB - variant

Common variant tags you'll see

SuffixMeaning
-Instruct / -Chat / -itFine-tuned to follow instructions / chat. Use this for assistants.
-BaseRaw pretrained model. Predicts next token but doesn't follow instructions well.
-Code / -CoderFine-tuned on programming tasks.
-VL / -VisionVision-language: accepts images too.
-Thinking / -Reasoning / -R1Optimized for chain-of-thought reasoning.
-Q4_K_M / -GGUF / -AWQ / -GPTQQuantization format. Smaller, faster, slight quality loss.
Final test — decode these full names
Pick a name above to fully decode it
Each part will be color-coded and explained.
Quick mental model: When you see a new name like SomeModel-X.Y-NNN-AMMb-Instruct:
  • NNN = total billions (memory budget)
  • AMMb = active billions (speed/compute budget) — only on MoE
  • Instruct = it'll respond to questions, not just continue text
Check your understanding
You see Qwen3-235B-A22B-Thinking-Q4_K_M. Which statement is true?

🎓 You've decoded the naming game

Next time you see Qwen3-30B-A3B or DeepSeek-V3-671B-A37B, you'll know exactly what hardware it needs and how it'll perform.

Learning Reference · LLM Model Names Decoded: Parameters, Quantization & Formats — StarMorph