**Data Format Fundamentals — Single Precision (FP32) vs Half Precision (FP16)**

Now, let’s take a closer look at FP32 and FP16 formats. The FP32 and FP16 are IEEE formats that represent floating numbers using 32-bit binary storage and 16-bit binary storage. Both formats comprise three parts: a) a sign bit, b) exponent bits, and c) mantissa bits. The FP32 and FP16 differ in the **number of bits allocated to exponent and mantissa**, which result in different value ranges and precisions.

How do you convert FP16 and FP32 to real values? According to IEEE-754 standards, the decimal value for FP32 = (-1)^(sign) × 2^(decimal exponent —127 ) × (implicit leading 1 + decimal mantissa), where 127 is the biased exponent value. For FP16, the formula becomes (-1)^(sign) × 2^(decimal exponent — 15) × (implicit leading 1 + decimal mantissa), where 15 is the corresponding biased exponent value. See further details of the biased exponent value here.

In this sense, the value range for FP32 is approximately [-2¹²⁷, 2¹²⁷] ~[-1.7*1e38, 1.7*1e38], and the value range for FP16 is approximately [-2¹⁵, 2¹⁵]=[-32768, 32768]. Note that the decimal exponent for FP32 is between 0 and 255, and we’re excluding the largest value 0xFF as it represents NAN. That’s why the largest decimal exponent is 254–127 = 127. A similar rule applies to FP16.

For the precision, note that both the exponent and mantissa contributes to the precision limits (which is also called **denormalization**, see detailed discussion here), so FP32 can represent precision up to 2^(-23)*2^(-126)=2^(-149), and FP16 can represent precision up to 2^(10)*2^(-14)=2^(-24).

The difference between FP32 and FP16 representations brings the key concerns of mixed precision training, as **different layers/operations of deep learning models are either insensitive or sensitive to value ranges and precision and need to be addressed separately**.

## Be the first to comment