- Community Home
- >
- Servers and Operating Systems
- >
- High Performance Computing
- >
- Understanding Quantization in Depth: FP32 vs FP16 ...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yesterday - last edited yesterday by support_s
yesterday - last edited yesterday by support_s
Understanding Quantization in Depth: FP32 vs FP16 vs BF16
In the world of LLMs and data science, GPUs are indispensable. They excel at handling high-throughput tasks, processing large batches of data with minimal transfer overhead. But what if your model's size exceeds the memory of a single GPU? While distributing the model across multiple GPUs is one solution, it's not always cost-effective. For those aiming to optimize costs and fit everything on a single GPU, quantization offers a powerful alternative (at minor compromise with accuracy at times). This technique reduces the model's memory footprint by lowering the precision of its numerical values: essentially, using fewer bits to represent numbers. However, this comes with trade-offs, particularly in accuracy.
Presenting below a beginner friendly technical guide with equations, bit-by-bit conversions, and real data-loss analysis.
Why Do We Need Lower-Precision Formats?
Modern LLMs (Llama 3 70B, Mixtral 8×22B, etc.) easily exceed 100–400 GB in FP32. A single high-end GPU (e.g., H100 80 GB) cannot hold them: you either buy 4–8 GPUs or quantize.
Quantization = converting weights & activations from high-precision (usually FP32) to lower-precision formats (FP16, BF16, INT8, etc.) to save memory and increase speed.
Here are the three most important floating-point formats in today’s AI world:
Key insight:
- FP16 gives more precision but tiny range leads to frequent overflow in training.
- BF16 sacrifices precision for the same huge range as FP32 but almost never overflows, perfect for deep learning.
The Universal Floating-Point Equation
For normal (non-subnormal, non-special) numbers:
value = (-1)^s × 2^(e - bias) × (1.f)
where
- s = sign bit (0 or 1)
- e = stored exponent (integer)
- f = fraction = mantissa_integer / 2^mantissa_bits
- The leading 1 is implicit (hidden bit) where you get mantissa_bits + 1 significant binary digits.
10 Real Numbers Converted to FP32, FP16, and BF16
(Values computed with exact IEEE 754 rounding: round-to-nearest-even)
Detailed Walk-through of 0.1
0.1 in binary is repeating: 0.0001100110011001100110011… (never ends)
Method: Repeatedly multiply the fractional part by 2.
- If the result ≥ 1, the bit is 1, subtract 1, keep the fractional remainder.
- If the result < 1, the bit is 0, keep it as is.
- Repeat with the new fractional part.
Starting number: 0.1
Binary fraction is 0.000110011001… (base 2) (where 1001…. Repeats forever…)
Where Data Loss Actually Happens
Mixed Precision Training: The Real-World Trick
Pure FP16 training often diverged in 2017–2018. NVIDIA’s Automatic Mixed Precision (AMP) and Google’s BF16 solved it.
Typical combination(PyTorch/TensorFlow):
- Store weights and activations in FP16 or BF16: 2× memory saving, 4–8× speed on Tensor Cores.
- Keep a FP32 master copy of weights.
- Compute gradients in low precision.
- Loss scaling (multiply loss by 2¹⁶ or dynamic scaling): prevents gradients from underflowing to zero.
- Accumulate gradients and update master weights in FP32: preserves tiny updates.
- Copy updated FP32 weights back to FP16/BF16 for next forward pass.
Result: near FP32 accuracy at FP16/BF16 speed and memory.
Padmaja is a Senior Data Scientist with HPE. She is responsible for end-to-end pursuit to delivery of projects. She also develops Go-To-Market solutions for the customer with variety of use cases & requirements.
- Tags:
- memory