Decoding LLM Model Names

1

What does the "B" in a model name actually mean?

▼

The B stands for billion, and the number in front of it is the model's parameter count — the number of learnable values inside the neural network.

3B

3,000,000,000 parameters

20B

20,000,000,000 parameters

120B

120,000,000,000 parameters

1T

1,000,000,000,000 parameters

Each parameter is a single floating-point number that the model learned during training. More parameters = more "knobs" the model can use to encode patterns from its training data.

Quick rule: The number before B is parameters in billions. M = millions, T = trillions. So gpt-oss-20b has 20 billion parameters, and Llama-3.1-405B has 405 billion.

Try it — click a model name to decode it

Pick a model above ↑

Each colored chunk has a meaning — they break down a model's identity.

Family

Version

Total params

Active params

Variant/structure

Check your understanding

A model labeled llama-3.2-3b has how many parameters?

2

What is a parameter, really?

▼

A neural network is made of layers of "neurons." Each connection between neurons has a number called a weight. Each neuron also has a bias. Weights + biases = parameters.

When the model processes text, it multiplies your input by these weights, adds biases, and passes the result through the next layer. The values of those weights are what was learned during training — they encode everything the model "knows."

A tiny network — every line is one parameter

→

This toy network: 20 parameters (3×4 weights + 4×2 weights = 20).
A 3B model has 3,000,000,000 of these. A 120B model? 40× more.

Why this matters: Every parameter takes memory. At common 16-bit precision (FP16/BF16), each parameter = 2 bytes. So a 7B model needs roughly 14 GB just to hold its weights. A 70B model? 140 GB.

Check your understanding

What does training actually do to these parameters?

3

The "A3B" mystery — Total vs Active parameters

▼

Names like Qwen3-30B-A3B have two numbers. They tell you the model is a Mixture of Experts (MoE).

30B = total parameters (all experts combined)
A3B = active parameters per token (only ~3B run for any given word)

An MoE model is split into many small "expert" sub-networks. For each token of input, a tiny router picks just a few experts to run. The rest stay idle.

MoE in action — only some experts fire per token

Token: "The"

▼ Router decides which experts to use ▼

30B

Total params loaded

3B

Active this token

~10×

Faster than dense

The tradeoff: An MoE model gives you the knowledge of a 30B model with the speed and compute of roughly a 3B model — but you still need to load all 30B into VRAM. So it's fast but memory-hungry.

How to read these names:

Name	Total	Active	Type
`Qwen3-30B-A3B`	30B	3B	MoE
`Qwen3-235B-A22B`	235B	22B	MoE
`DeepSeek-V3-671B-A37B`	671B	37B	MoE
`Mixtral-8x7B`	~47B	~13B	MoE (8 experts, 2 active)
`Llama-3.1-70B`	70B	70B	Dense
`gpt-oss-20b`	20B	20B	Dense

Check your understanding

If I run Qwen3-235B-A22B, how much VRAM should I plan for?

4

How much memory does each size actually need?

▼

Memory required ≈ parameter count × bytes-per-parameter. Common formats:

FP32

4 bytes/param

FP16 / BF16

2 bytes/param

INT8

1 byte/param

INT4 (quantized)

0.5 bytes/param

VRAM needed at FP16 — pick a quant level

Why people quantize: Going from FP16 → INT4 cuts memory by 4× with small quality loss. That's how a 70B model fits on a 48 GB GPU instead of needing 140 GB.

Check your understanding

At INT4 quantization, roughly how much VRAM does a 7B model need?

5

What FP16, BF16, INT8, INT4 actually mean

▼

Each parameter is a single number. How that number is stored — how many bits it uses and what shape those bits take — determines both memory size and precision.

FP = Floating Point — like scientific notation. Sign + exponent + mantissa.
INT = Integer — whole numbers in a fixed range, no fractions.
The number after = how many bits each parameter takes.

Click a format to see how a parameter is stored

FP32 32 bits · 4 bytes/param

FP16 (half precision) 16 bits · 2 bytes/param

BF16 (brain float) 16 bits · 2 bytes/param

INT8 8 bits · 1 byte/param

INT4 (4-bit quantized) 4 bits · 0.5 bytes/param

sign (±) exponent (range) mantissa (precision) integer value

The key insight

Neural network weights are mostly small numbers clustered around zero. You don't need full FP32 precision to represent them — you can store many of them in 4 or 8 bits with surprisingly little quality loss. That's quantization.

FP16 vs BF16: Both are 2 bytes, but they split the bits differently. BF16 keeps FP32's wide range (8 exponent bits) but cuts precision (7 mantissa bits). FP16 keeps more precision (10 mantissa bits) but has a smaller range (5 exponent bits). BF16 is preferred for training because it can hold the wide range of gradient values without overflowing.

Why INT4 works: 4 bits = only 16 possible values. But by storing a scale factor per group of weights (e.g., every 32 weights share one FP16 multiplier), you effectively get many more "virtual" values. Modern formats like GPTQ, AWQ, GGUF Q4_K_M use clever grouping to keep quality high.

Check your understanding

Why is BF16 often preferred over FP16 for training large models?

6

Bigger isn't always better — size vs capability

▼

Larger models generally know more facts and reason better — but they're slower, pricier to run, and not always smarter on specific tasks. Here's the practical view:

Small (1B–3B)

Phone

Runs on laptop/mobile. Great for autocomplete, classification, structured tasks.

Mid (7B–13B)

Laptop

Runs on consumer GPUs. Solid general chat, code, summarization.

Large (30B–70B)

Workstation

Needs 1–2 high-end GPUs. Strong reasoning, complex code.

Frontier (100B+)

Data center

Multi-GPU clusters. PhD-level reasoning, multimodal, long context.

The MoE sweet spot

MoE models like Qwen3-30B-A3B hit a unique tradeoff: knowledge of a large dense model, runtime cost of a small one. The catch is VRAM — you still need to load all the experts, even if only a few run per token.

Practical takeaway: Pick the smallest model that does your task well. A well-tuned 7B often beats a generic 70B on narrow jobs, while costing 10× less to serve.

Check your understanding

You want a model for autocomplete in a mobile keyboard app. Best pick?

7

Putting it all together — decoding real model names

▼

Real model names cram a lot in. A typical format looks like this:

        family
        -
        version
        -
        totalB
        -
        AactiveB
        -
        variant
      

Common variant tags you'll see

Suffix	Meaning
`-Instruct` / `-Chat` / `-it`	Fine-tuned to follow instructions / chat. Use this for assistants.
`-Base`	Raw pretrained model. Predicts next token but doesn't follow instructions well.
`-Code` / `-Coder`	Fine-tuned on programming tasks.
`-VL` / `-Vision`	Vision-language: accepts images too.
`-Thinking` / `-Reasoning` / `-R1`	Optimized for chain-of-thought reasoning.
`-Q4_K_M` / `-GGUF` / `-AWQ` / `-GPTQ`	Quantization format. Smaller, faster, slight quality loss.

Final test — decode these full names

Pick a name above to fully decode it

Each part will be color-coded and explained.

Quick mental model: When you see a new name like SomeModel-X.Y-NNN-AMMb-Instruct:

NNN = total billions (memory budget)
AMMb = active billions (speed/compute budget) — only on MoE
Instruct = it'll respond to questions, not just continue text

Check your understanding

You see Qwen3-235B-A22B-Thinking-Q4_K_M. Which statement is true?

What does the "B" in a model name actually mean?

What is a parameter, really?

The "A3B" mystery — Total vs Active parameters

How to read these names:

How much memory does each size actually need?

What FP16, BF16, INT8, INT4 actually mean

The key insight

Bigger isn't always better — size vs capability

The MoE sweet spot

Putting it all together — decoding real model names

Common variant tags you'll see

🎓 You've decoded the naming game