High Performance Computing
1842948 Members
3132 Online
110211 Solutions
New Discussion

Understanding Quantization in Depth: FP32 vs FP16 vs BF16

 
Padmaja_V
HPE Pro

Understanding Quantization in Depth: FP32 vs FP16 vs BF16

In the world of LLMs and data science, GPUs are indispensable. They excel at handling high-throughput tasks, processing large batches of data with minimal transfer overhead. But what if your model's size exceeds the memory of a single GPU? While distributing the model across multiple GPUs is one solution, it's not always cost-effective. For those aiming to optimize costs and fit everything on a single GPU, quantization offers a powerful alternative (at minor compromise with accuracy at times). This technique reduces the model's memory footprint by lowering the precision of its numerical values: essentially, using fewer bits to represent numbers. However, this comes with trade-offs, particularly in accuracy.

Presenting below a beginner friendly technical guide with equations, bit-by-bit conversions, and real data-loss analysis.

Why Do We Need Lower-Precision Formats?

Modern LLMs (Llama 3 70B, Mixtral 8×22B, etc.) easily exceed 100–400 GB in FP32. A single high-end GPU (e.g., H100 80 GB) cannot hold them: you either buy 4–8 GPUs or quantize.

Quantization = converting weights & activations from high-precision (usually FP32) to lower-precision formats (FP16, BF16, INT8, etc.) to save memory and increase speed.

Here are the three most important floating-point formats in today’s AI world:

image.png

Key insight:

  • FP16 gives more precision but tiny range leads to frequent overflow in training.
  • BF16 sacrifices precision for the same huge range as FP32 but almost never overflows, perfect for deep learning.

The Universal Floating-Point Equation

For normal (non-subnormal, non-special) numbers:

value = (-1)^s × 2^(e - bias) × (1.f)

where

  • s = sign bit (0 or 1)
  • e = stored exponent (integer)
  • f = fraction = mantissa_integer / 2^mantissa_bits
  • The leading 1 is implicit (hidden bit) where you get mantissa_bits + 1 significant binary digits.

 10 Real Numbers Converted to FP32, FP16, and BF16

(Values computed with exact IEEE 754 rounding: round-to-nearest-even)

Original.png

Detailed Walk-through of 0.1

0.1 in binary is repeating: 0.0001100110011001100110011… (never ends)

Method: Repeatedly multiply the fractional part by 2.

  • If the result ≥ 1, the bit is 1, subtract 1, keep the fractional remainder.
  • If the result < 1, the bit is 0, keep it as is.
  • Repeat with the new fractional part.

Starting number: 0.1

Step.png

Binary fraction is 0.000110011001…  (base 2) (where 1001…. Repeats forever…)

 

Where Data Loss Actually Happens

SourceOfError.png

 

Mixed Precision Training: The Real-World Trick

Pure FP16 training often diverged in 2017–2018. NVIDIA’s Automatic Mixed Precision (AMP) and Google’s BF16 solved it.

Typical combination(PyTorch/TensorFlow):

  1. Store weights and activations in FP16 or BF16: 2× memory saving, 4–8× speed on Tensor Cores.
  2. Keep a FP32 master copy of weights.
  3. Compute gradients in low precision.
  4. Loss scaling (multiply loss by 2¹⁶ or dynamic scaling): prevents gradients from underflowing to zero.
  5. Accumulate gradients and update master weights in FP32: preserves tiny updates.
  6. Copy updated FP32 weights back to FP16/BF16 for next forward pass.

Result: near FP32 accuracy at FP16/BF16 speed and memory.

 

Padmaja Vaduguru
Padmaja is a Senior Data Scientist with HPE. She is responsible for end-to-end pursuit to delivery of projects. She also develops Go-To-Market solutions for the customer with variety of use cases & requirements.