Understanding Quantization in Depth: FP32 vs FP16 vs BF16

Padmaja_V — Mon, 01 Dec 2025 16:38:04 GMT

In the world of LLMs and data science, GPUs are indispensable. They excel at handling high-throughput tasks, processing large batches of data with minimal transfer overhead. But what if your model's size exceeds the memory of a single GPU? While distributing the model across multiple GPUs is one solution, it's not always cost-effective. For those aiming to optimize costs and fit everything on a single GPU, quantization offers a powerful alternative (at minor compromise with accuracy at times). This technique reduces the model's memory footprint by lowering the precision of its numerical values: essentially, using fewer bits to represent numbers. However, this comes with trade-offs, particularly in accuracy.

Presenting below a beginner friendly technical guide with equations, bit-by-bit conversions, and real data-loss analysis.

Why Do We Need Lower-Precision Formats?

Modern LLMs (Llama 3 70B, Mixtral 8×22B, etc.) easily exceed 100–400 GB in FP32. A single high-end GPU (e.g., H100 80 GB) cannot hold them: you either buy 4–8 GPUs or quantize.

Quantization = converting weights & activations from high-precision (usually FP32) to lower-precision formats (FP16, BF16, INT8, etc.) to save memory and increase speed.

Here are the three most important floating-point formats in today’s AI world:

Key insight:

FP16 gives more precision but tiny range leads to frequent overflow in training.
BF16 sacrifices precision for the same huge range as FP32 but almost never overflows, perfect for deep learning.

The Universal Floating-Point Equation

For normal (non-subnormal, non-special) numbers:

value = (-1)^s × 2^(e - bias) × (1.f)

where

s = sign bit (0 or 1)
e = stored exponent (integer)
f = fraction = mantissa_integer / 2^mantissa_bits
The leading 1 is implicit (hidden bit) where you get mantissa_bits + 1 significant binary digits.

10 Real Numbers Converted to FP32, FP16, and BF16

(Values computed with exact IEEE 754 rounding: round-to-nearest-even)

Detailed Walk-through of 0.1

0.1 in binary is repeating: 0.0001100110011001100110011… (never ends)

Method: Repeatedly multiply the fractional part by 2.

If the result ≥ 1, the bit is 1, subtract 1, keep the fractional remainder.
If the result < 1, the bit is 0, keep it as is.
Repeat with the new fractional part.

Starting number: 0.1

Binary fraction is 0.000110011001… (base 2) (where 1001…. Repeats forever…)

Where Data Loss Actually Happens

Mixed Precision Training: The Real-World Trick

Pure FP16 training often diverged in 2017–2018. NVIDIA’s Automatic Mixed Precision (AMP) and Google’s BF16 solved it.

Typical combination(PyTorch/TensorFlow):

Store weights and activations in FP16 or BF16: 2× memory saving, 4–8× speed on Tensor Cores.
Keep a FP32 master copy of weights.
Compute gradients in low precision.
Loss scaling (multiply loss by 2¹⁶ or dynamic scaling): prevents gradients from underflowing to zero.
Accumulate gradients and update master weights in FP32: preserves tiny updates.
Copy updated FP32 weights back to FP16/BF16 for next forward pass.

Result: near FP32 accuracy at FP16/BF16 speed and memory.

topic Understanding Quantization in Depth: FP32 vs FP16 vs BF16 in High Performance Computing

Understanding Quantization in Depth: FP32 vs FP16 vs BF16