Understanding Quantization in Depth: FP32 vs FP16 vs BF16

Padmaja_V · ‎12-01-2025

In the world of LLMs and data science, GPUs are indispensable. They excel at handling high-throughput tasks, processing large batches of data with minimal transfer overhead. But what if your model's size exceeds the memory of a single GPU? While distributing the model across multiple GPUs is one solution, it's not always cost-effective. For those aiming to optimize costs and fit everything on a single GPU, quantization offers a powerful alternative (at minor compromise with accuracy at times). This technique reduces the model's memory footprint by lowering the precision of its numerical values: essentially, using fewer bits to represent numbers. However, this comes with trade-offs, particularly in accuracy.

Presenting below a beginner friendly technical guide with equations, bit-by-bit conversions, and real data-loss analysis.

Why Do We Need Lower-Precision Formats?

Modern LLMs (Llama 3 70B, Mixtral 8×22B, etc.) easily exceed 100–400 GB in FP32. A single high-end GPU (e.g., H100 80 GB) cannot hold them: you either buy 4–8 GPUs or quantize.

Quantization = converting weights & activations from high-precision (usually FP32) to lower-precision formats (FP16, BF16, INT8, etc.) to save memory and increase speed.

Here are the three most important floating-point formats in today’s AI world:

Key insight:

FP16 gives more precision but tiny range leads to frequent overflow in training.
BF16 sacrifices precision for the same huge range as FP32 but almost never overflows, perfect for deep learning.

The Universal Floating-Point Equation

For normal (non-subnormal, non-special) numbers:

value = (-1)^s × 2^(e - bias) × (1.f)

where

s = sign bit (0 or 1)
e = stored exponent (integer)
f = fraction = mantissa_integer / 2^mantissa_bits
The leading 1 is implicit (hidden bit) where you get mantissa_bits + 1 significant binary digits.

10 Real Numbers Converted to FP32, FP16, and BF16

(Values computed with exact IEEE 754 rounding: round-to-nearest-even)

Detailed Walk-through of 0.1

0.1 in binary is repeating: 0.0001100110011001100110011… (never ends)

Method: Repeatedly multiply the fractional part by 2.

If the result ≥ 1, the bit is 1, subtract 1, keep the fractional remainder.
If the result < 1, the bit is 0, keep it as is.
Repeat with the new fractional part.

Starting number: 0.1

Binary fraction is 0.000110011001… (base 2) (where 1001…. Repeats forever…)

Where Data Loss Actually Happens

Mixed Precision Training: The Real-World Trick

Pure FP16 training often diverged in 2017–2018. NVIDIA’s Automatic Mixed Precision (AMP) and Google’s BF16 solved it.

Typical combination(PyTorch/TensorFlow):

Store weights and activations in FP16 or BF16: 2× memory saving, 4–8× speed on Tensor Cores.
Keep a FP32 master copy of weights.
Compute gradients in low precision.
Loss scaling (multiply loss by 2¹⁶ or dynamic scaling): prevents gradients from underflowing to zero.
Accumulate gradients and update master weights in FP32: preserves tiny updates.
Copy updated FP32 weights back to FP16/BF16 for next forward pass.

Result: near FP32 accuracy at FP16/BF16 speed and memory.

Padmaja Vaduguru
Padmaja is a Senior Data Scientist with HPE. She is responsible for end-to-end pursuit to delivery of projects. She also develops Go-To-Market solutions for the customer with variety of use cases & requirements.

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Understanding Quantization in Depth: FP32 vs FP16 vs BF16

Understanding Quantization in Depth: FP32 vs FP16 vs BF16