What does "3B" mean? Why is it sometimes "120B" and sometimes "235B-A22B"? Learn how to read modern model names like a pro.
The B stands for billion, and the number in front of it
is the model's parameter count — the number of learnable values
inside the neural network.
Each parameter is a single floating-point number that the model learned during training. More parameters = more "knobs" the model can use to encode patterns from its training data.
B is parameters in billions.
M = millions, T = trillions. So gpt-oss-20b has 20 billion
parameters, and Llama-3.1-405B has 405 billion.
llama-3.2-3b has how many parameters?A neural network is made of layers of "neurons." Each connection between neurons has a number called a weight. Each neuron also has a bias. Weights + biases = parameters.
When the model processes text, it multiplies your input by these weights, adds biases, and passes the result through the next layer. The values of those weights are what was learned during training — they encode everything the model "knows."
Names like Qwen3-30B-A3B have two numbers. They tell you the model
is a Mixture of Experts (MoE).
An MoE model is split into many small "expert" sub-networks. For each token of input, a tiny router picks just a few experts to run. The rest stay idle.
| Name | Total | Active | Type |
|---|---|---|---|
Qwen3-30B-A3B | 30B | 3B | MoE |
Qwen3-235B-A22B | 235B | 22B | MoE |
DeepSeek-V3-671B-A37B | 671B | 37B | MoE |
Mixtral-8x7B | ~47B | ~13B | MoE (8 experts, 2 active) |
Llama-3.1-70B | 70B | 70B | Dense |
gpt-oss-20b | 20B | 20B | Dense |
Qwen3-235B-A22B, how much VRAM should I plan for?Memory required ≈ parameter count × bytes-per-parameter. Common formats:
Each parameter is a single number. How that number is stored — how many bits it uses and what shape those bits take — determines both memory size and precision.
Neural network weights are mostly small numbers clustered around zero. You don't need full FP32 precision to represent them — you can store many of them in 4 or 8 bits with surprisingly little quality loss. That's quantization.
Larger models generally know more facts and reason better — but they're slower, pricier to run, and not always smarter on specific tasks. Here's the practical view:
MoE models like Qwen3-30B-A3B hit a unique tradeoff: knowledge of a large
dense model, runtime cost of a small one. The catch is VRAM — you still need to load
all the experts, even if only a few run per token.
Real model names cram a lot in. A typical format looks like this:
| Suffix | Meaning |
|---|---|
-Instruct / -Chat / -it | Fine-tuned to follow instructions / chat. Use this for assistants. |
-Base | Raw pretrained model. Predicts next token but doesn't follow instructions well. |
-Code / -Coder | Fine-tuned on programming tasks. |
-VL / -Vision | Vision-language: accepts images too. |
-Thinking / -Reasoning / -R1 | Optimized for chain-of-thought reasoning. |
-Q4_K_M / -GGUF / -AWQ / -GPTQ | Quantization format. Smaller, faster, slight quality loss. |
SomeModel-X.Y-NNN-AMMb-Instruct:
NNN = total billions (memory budget)AMMb = active billions (speed/compute budget) — only on MoEInstruct = it'll respond to questions, not just continue textQwen3-235B-A22B-Thinking-Q4_K_M. Which statement is true?Next time you see Qwen3-30B-A3B or DeepSeek-V3-671B-A37B,
you'll know exactly what hardware it needs and how it'll perform.