<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Understanding Quantization in Depth: FP32 vs FP16 vs BF16 in High Performance Computing</title>
    <link>https://community.hpe.com/t5/high-performance-computing/understanding-quantization-in-depth-fp32-vs-fp16-vs-bf16/m-p/7259287#M357</link>
    <description>&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;In the world of LLMs and data science, GPUs are indispensable. They excel at handling high-throughput tasks, processing large batches of data with minimal transfer overhead. But what if your model's size exceeds the memory of a single GPU? While distributing the model across multiple GPUs is one solution, it's not always cost-effective. For those aiming to optimize costs and fit everything on a single GPU, quantization offers a powerful alternative (at minor compromise with accuracy at times). This technique reduces the model's memory footprint by lowering the precision of its numerical values: essentially, using fewer bits to represent numbers. However, this comes with trade-offs, particularly in accuracy.&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Presenting below a beginner friendly technical guide with equations, bit-by-bit conversions, and real data-loss analysis.&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;STRONG&gt;Why Do We Need Lower-Precision Formats?&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Modern LLMs (Llama 3 70B, Mixtral 8×22B, etc.) easily exceed 100–400 GB in FP32. A single high-end GPU (e.g., H100 80 GB) cannot hold them: you either buy 4–8 GPUs or quantize.&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Quantization = converting weights &amp;amp; activations from high-precision (usually FP32) to lower-precision formats (FP16, BF16, INT8, etc.) to save memory and increase speed.&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Here are the three most important floating-point formats in today’s AI world:&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;EM&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image.png" style="width: 1474px;"&gt;&lt;img src="https://community.hpe.com/t5/image/serverpage/image-id/153130iFB536FEC0EBD38CB/image-size/large?v=v2&amp;amp;px=2000" role="button" title="image.png" alt="image.png" /&gt;&lt;/span&gt;&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;EM&gt;Key insight:&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;FP16 gives more precision but tiny range leads to frequent overflow in training.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;BF16 sacrifices precision for the same huge range as FP32 but almost never overflows, perfect for deep learning.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;STRONG&gt;The Universal Floating-Point Equation&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;For normal (non-subnormal, non-special) numbers:&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;value = (-1)^s × 2^(e - bias) × (1.f)&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;where&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;s = sign bit (0 or 1)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;e = stored exponent (integer)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;f = fraction = mantissa_integer / 2^mantissa_bits&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;The leading 1 is implicit (hidden bit) where you get mantissa_bits + 1 significant binary digits.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&amp;nbsp;&lt;SPAN&gt;10 Real Numbers Converted to FP32, FP16, and BF16&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;(Values computed with exact IEEE 754 rounding: round-to-nearest-even)&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-center" image-alt="Original.png" style="width: 20000px;"&gt;&lt;img src="https://community.hpe.com/t5/image/serverpage/image-id/153156i6608CD62893D8C20/image-size/large?v=v2&amp;amp;px=2000" role="button" title="Original.png" alt="Original.png" /&gt;&lt;/span&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;STRONG&gt;Detailed Walk-through of 0.1&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;0.1 in binary is repeating: 0.0001100110011001100110011… (never ends)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Method: Repeatedly multiply the fractional part by 2.&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;If the result ≥ 1, the bit is 1, subtract 1, keep the fractional remainder.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;If the result &amp;lt; 1, the bit is 0, keep it as is.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;Repeat with the new fractional part.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Starting number: 0.1&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-center" image-alt="Step.png" style="width: 400px;"&gt;&lt;img src="https://community.hpe.com/t5/image/serverpage/image-id/153157i34FF72934A2CE8E0/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Step.png" alt="Step.png" /&gt;&lt;/span&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P class="lia-align-center" style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;EM&gt;Binary fraction is 0.000110011001…&amp;nbsp; (base 2) (where 1001…. Repeats forever…)&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&amp;nbsp;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;STRONG&gt;Where Data Loss Actually Happens&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;STRONG&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-center" image-alt="SourceOfError.png" style="width: 966px;"&gt;&lt;img src="https://community.hpe.com/t5/image/serverpage/image-id/153158iF00E4699A2712DB0/image-size/large?v=v2&amp;amp;px=2000" role="button" title="SourceOfError.png" alt="SourceOfError.png" /&gt;&lt;/span&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;STRONG&gt;Mixed Precision Training: The Real-World Trick&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin: 0in; font-family: Tahoma; font-size: 11.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Pure FP16 training often diverged in 2017–2018. NVIDIA’s Automatic Mixed Precision (AMP) and Google’s BF16 solved it.&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin: 0in; font-family: Tahoma; font-size: 11.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Typical combination(PyTorch/TensorFlow):&lt;/FONT&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;Store weights and activations in FP16 or BF16: 2× memory saving, 4–8× speed on Tensor Cores.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;Keep a FP32 master copy of weights.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;Compute gradients in low precision.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;Loss scaling (multiply loss by 2¹⁶ or dynamic scaling): prevents gradients from underflowing to zero.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;Accumulate gradients and update master weights in FP32: preserves tiny updates.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;Copy updated FP32 weights back to FP16/BF16 for next forward pass.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P style="margin: 0in; font-family: Tahoma; font-size: 11.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Result: near FP32 accuracy at FP16/BF16 speed and memory.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 01 Dec 2025 16:38:04 GMT</pubDate>
    <dc:creator>Padmaja_V</dc:creator>
    <dc:date>2025-12-01T16:38:04Z</dc:date>
    <item>
      <title>Understanding Quantization in Depth: FP32 vs FP16 vs BF16</title>
      <link>https://community.hpe.com/t5/high-performance-computing/understanding-quantization-in-depth-fp32-vs-fp16-vs-bf16/m-p/7259287#M357</link>
      <description>&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;In the world of LLMs and data science, GPUs are indispensable. They excel at handling high-throughput tasks, processing large batches of data with minimal transfer overhead. But what if your model's size exceeds the memory of a single GPU? While distributing the model across multiple GPUs is one solution, it's not always cost-effective. For those aiming to optimize costs and fit everything on a single GPU, quantization offers a powerful alternative (at minor compromise with accuracy at times). This technique reduces the model's memory footprint by lowering the precision of its numerical values: essentially, using fewer bits to represent numbers. However, this comes with trade-offs, particularly in accuracy.&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Presenting below a beginner friendly technical guide with equations, bit-by-bit conversions, and real data-loss analysis.&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;STRONG&gt;Why Do We Need Lower-Precision Formats?&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Modern LLMs (Llama 3 70B, Mixtral 8×22B, etc.) easily exceed 100–400 GB in FP32. A single high-end GPU (e.g., H100 80 GB) cannot hold them: you either buy 4–8 GPUs or quantize.&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Quantization = converting weights &amp;amp; activations from high-precision (usually FP32) to lower-precision formats (FP16, BF16, INT8, etc.) to save memory and increase speed.&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Here are the three most important floating-point formats in today’s AI world:&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;EM&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="image.png" style="width: 1474px;"&gt;&lt;img src="https://community.hpe.com/t5/image/serverpage/image-id/153130iFB536FEC0EBD38CB/image-size/large?v=v2&amp;amp;px=2000" role="button" title="image.png" alt="image.png" /&gt;&lt;/span&gt;&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;EM&gt;Key insight:&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;FP16 gives more precision but tiny range leads to frequent overflow in training.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;BF16 sacrifices precision for the same huge range as FP32 but almost never overflows, perfect for deep learning.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;STRONG&gt;The Universal Floating-Point Equation&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;For normal (non-subnormal, non-special) numbers:&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;value = (-1)^s × 2^(e - bias) × (1.f)&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;where&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;s = sign bit (0 or 1)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;e = stored exponent (integer)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;f = fraction = mantissa_integer / 2^mantissa_bits&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;The leading 1 is implicit (hidden bit) where you get mantissa_bits + 1 significant binary digits.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&amp;nbsp;&lt;SPAN&gt;10 Real Numbers Converted to FP32, FP16, and BF16&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;(Values computed with exact IEEE 754 rounding: round-to-nearest-even)&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-center" image-alt="Original.png" style="width: 20000px;"&gt;&lt;img src="https://community.hpe.com/t5/image/serverpage/image-id/153156i6608CD62893D8C20/image-size/large?v=v2&amp;amp;px=2000" role="button" title="Original.png" alt="Original.png" /&gt;&lt;/span&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;STRONG&gt;Detailed Walk-through of 0.1&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;0.1 in binary is repeating: 0.0001100110011001100110011… (never ends)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Method: Repeatedly multiply the fractional part by 2.&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;If the result ≥ 1, the bit is 1, subtract 1, keep the fractional remainder.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;If the result &amp;lt; 1, the bit is 0, keep it as is.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;Repeat with the new fractional part.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Starting number: 0.1&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-center" image-alt="Step.png" style="width: 400px;"&gt;&lt;img src="https://community.hpe.com/t5/image/serverpage/image-id/153157i34FF72934A2CE8E0/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Step.png" alt="Step.png" /&gt;&lt;/span&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P class="lia-align-center" style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;EM&gt;Binary fraction is 0.000110011001…&amp;nbsp; (base 2) (where 1001…. Repeats forever…)&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&amp;nbsp;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;STRONG&gt;Where Data Loss Actually Happens&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;STRONG&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-center" image-alt="SourceOfError.png" style="width: 966px;"&gt;&lt;img src="https://community.hpe.com/t5/image/serverpage/image-id/153158iF00E4699A2712DB0/image-size/large?v=v2&amp;amp;px=2000" role="button" title="SourceOfError.png" alt="SourceOfError.png" /&gt;&lt;/span&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin-top: 0pt; margin-bottom: 8pt; font-family: Tahoma; font-size: 10.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;STRONG&gt;Mixed Precision Training: The Real-World Trick&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin: 0in; font-family: Tahoma; font-size: 11.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Pure FP16 training often diverged in 2017–2018. NVIDIA’s Automatic Mixed Precision (AMP) and Google’s BF16 solved it.&lt;/FONT&gt;&lt;/P&gt;&lt;P style="margin: 0in; font-family: Tahoma; font-size: 11.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Typical combination(PyTorch/TensorFlow):&lt;/FONT&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;Store weights and activations in FP16 or BF16: 2× memory saving, 4–8× speed on Tensor Cores.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;Keep a FP32 master copy of weights.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;Compute gradients in low precision.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;Loss scaling (multiply loss by 2¹⁶ or dynamic scaling): prevents gradients from underflowing to zero.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;Accumulate gradients and update master weights in FP32: preserves tiny updates.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;&lt;SPAN&gt;Copy updated FP32 weights back to FP16/BF16 for next forward pass.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P style="margin: 0in; font-family: Tahoma; font-size: 11.0pt;"&gt;&lt;FONT face="tahoma,arial,helvetica,sans-serif" size="3"&gt;Result: near FP32 accuracy at FP16/BF16 speed and memory.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 01 Dec 2025 16:38:04 GMT</pubDate>
      <guid>https://community.hpe.com/t5/high-performance-computing/understanding-quantization-in-depth-fp32-vs-fp16-vs-bf16/m-p/7259287#M357</guid>
      <dc:creator>Padmaja_V</dc:creator>
      <dc:date>2025-12-01T16:38:04Z</dc:date>
    </item>
  </channel>
</rss>

