The world’s leading publication for data science, AI, and ML professionals.

16, 8, and 4-bit Floating Point Formats – How Does it Work?

Let's go into bits and bytes

Image by Adrien Converse on Unsplash
Image by Adrien Converse on Unsplash

For 50 years, from the time of Kernighan, Ritchie, and their 1st edition of the C Language book, it was known that a single-precision "float" type has a 32-bit size and a double-precision type has 64 bits. There was also an 80-bit "long double" type with extended precision, and all these types covered almost all the needs for floating-point data processing. However, during the last few years, the advent of large neural network models required developers to move into another part of the spectrum and to shrink floating point types as much as possible.

Honestly, I was surprised when I discovered that the 4-bit floating-point format exists. How on Earth can it be possible? The best way to know is to test it on our own. In this article, we will discover the most popular floating point formats, make a simple neural network, and see how it works.

Let’s get started.

A "Standard" 32-bit Floating point

Before going into "extreme" formats, let’s recall a standard one. An IEEE 754 standard for floating-point arithmetic was established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). A typical number in a 32-float type looks like this:

A 32-bit float example, Source Wikipedia
A 32-bit float example, Source Wikipedia

Here, the first bit is a sign, the next 8 bits represent an exponent, and the last bits represent the mantissa. The final value is calculated using the formula:

Floating point calculation, Source Wikipedia
Floating point calculation, Source Wikipedia

This simple helper function allows us to print a floating point value in binary form:

import struct

def print_float32(val: float):
    """ Print Float32 in a binary form """
    m = struct.unpack('I', struct.pack('f', val))[0]
    return format(m, 'b').zfill(32)

print_float32(0.15625)

# > 00111110001000000000000000000000 

Let’s also make another helper for backward conversion, which will be useful later:

def ieee_754_conversion(sign, exponent_raw, mantissa, exp_len=8, mant_len=23):
    """ Convert binary data into the floating point value """
    sign_mult = -1 if sign == 1 else 1
    exponent = exponent_raw - (2 ** (exp_len - 1) - 1)
    mant_mult = 1
    for b in range(mant_len - 1, -1, -1):
        if mantissa & (2 ** b):
            mant_mult += 1 / (2 ** (mant_len - b))

    return sign_mult * (2 ** exponent) * mant_mult

ieee_754_conversion(0b0, 0b01111100, 0b01000000000000000000000)

#> 0.15625

And I hope every software developer and IT enthusiast knows that the floating point type has limited accuracy:

val = 3.14
print(f"{val:.20f}")

# > 3.14000000000000012434

In this case, it is not a huge problem, but the fewer bits we have, the less accuracy we get. And as we will see soon, accuracy can be an issue. Now, let’s start our journey into the rabbit hole…

16-bit Floating point

Apparently, there was no big demand for that format earlier, and a 16-bit floating point type was added to the IEEE 754 standard only in 2008. It has a sign bit, a 5-bit exponent, and a 10-bit mantissa (fraction):

A 16-bit float, Image source Wikipedia
A 16-bit float, Image source Wikipedia

The conversion logic is the same as for 32-bit float, but the accuracy is obviously less. Let’s print a 16-bit float in binary form:

import numpy as np

def print_float16(val: float):
    """ Print Float16 in a binary form """
    m = struct.unpack('H', struct.pack('e', np.float16(val)))[0]
    return format(m, 'b').zfill(16)

print_float16(3.14)

# > 0100001001001000

With the method we used before, we can make a back conversion:

ieee_754_conversion(0, 0b10000, 0b1001001000, exp_len=5, mant_len=10)

# > 3.140625

We can also find the maximum value that can be represented in Float16:

ieee_754_conversion(0, 0b11110, 0b1111111111, exp_len=5, mant_len=10)

#> 65504.0

I used 0b11110 because, in the IEEE 754 standard, 0b11111 is reserved for "infinity". We can also find the possible minimum value:

ieee_754_conversion(0, 0b00001, 0b0000000000, exp_len=5, mant_len=10)

#> 0.00006104

Types like that are a sort of "uncharted territory" for most developers, and apparently, even nowadays, there is no standard 16-bit float type in C++. But the variety of types is even bigger.

16-bit "bfloat" (BFP16)

This floating point format was developed by the Google Brain team, and it is specially designed for Machine Learning (and "B" in its name also stands for "brain"). This type is a modification of the "standard" 16-bit float: the exponent was enlarged into 8 bits, thus the dynamic range of "bfloat16" is actually the same as for float-32. But the mantissa size was reduced to 7 bits:

A 16-bit bfloat16, Image source Wikipedia
A 16-bit bfloat16, Image source Wikipedia

Let’s do a similar calculation as before:

ieee_754_conversion(0, 0b10000000, 0b1001001, exp_len=8, mant_len=7)

#> 3.140625

As was written before, because of the larger exponent, a bfloat16 format has a much wider range:

ieee_754_conversion(0, 0b11111110, 0b1111111, exp_len=8, mant_len=7)

#> 3.3895313892515355e+38

This is much better compared to 65504.0 in the previous example, but as was mentioned, the bfloat16 precision is lower because of the smaller number of bits in the mantissa. We can test both types in Tensorflow:

import tensorflow as tf

print(f"{tf.constant(1.2, dtype=tf.float16).numpy().item():.12f}")

# > 1.200195312500

print(f"{tf.constant(1.2, dtype=tf.bfloat16).numpy().item():.12f}")

# > 1.203125000000

8-bit float (FP8)

This (relatively new) format was proposed in 2022, and as readers can guess, it was also created for machine learning – the models became larger, and it was a challenge to fit them in the GPU memory. The FP8 format exists in two variants: E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa):

8-bit floats, Image source Wikipedia
8-bit floats, Image source Wikipedia

Let’s get the maximum possible values for both formats:

ieee_754_conversion(0, 0b1111, 0b110, exp_len=4, mant_len=3)

# > 448.0

ieee_754_conversion(0, 0b11110, 0b11, exp_len=5, mant_len=2)

# > 57344.0

We can also use FP8 in Tensorflow:

import tensorflow as tf
from tensorflow.python.framework import dtypes

a_fp8 = tf.constant(3.14, dtype=dtypes.float8_e4m3fn)
print(a_fp8)

# > 3.25

a_fp8 = tf.constant(3.14, dtype=dtypes.float8_e5m2)
print(a_fp8)

# > 3.0

Let’s draw a sine wave in both types:

import numpy as np
import tensorflow as tf
from tensorflow.python.framework import dtypes
import matplotlib.pyplot as plt

length = np.pi * 4
resolution = 200
xvals = np.arange(0, length, length / resolution)
wave = np.sin(xvals)
wave_fp8_1 = tf.cast(wave, dtypes.float8_e4m3fn)
wave_fp8_2 = tf.cast(wave, dtypes.float8_e5m2)

plt.rcParams["figure.figsize"] = (14, 5)
plt.plot(xvals, wave_fp8_1.numpy())
plt.plot(xvals, wave_fp8_2.numpy())
plt.show()

The result is surprisingly, not bad:

A sine wave data in FP8 formats, Image by author
A sine wave data in FP8 formats, Image by author

We can obviously see some loss of precision, but this image still looks like a sine wave!

4-bit Floating point types

Now let’s go to the most "crazy" stuff – to 4-bit floating point values! Actually, a 4-bit float (FP4) is the smallest possible value that follows an IEEE standard, with a 1-bit sign, a 2-bit exponent, and a 1-bit mantissa:

FP4 value, Image by author
FP4 value, Image by author

The amount of possible stored information is obviously not large, and it actually fits in the 16-item array!

The second possible 4-bit implementation is the so-called NormalFloat (NF4) data type. The NF4 values were optimized for saving normally distributed variables. It’s hard to do it with other data types, but all possible NF4 values can be easily printed in one list:

[-1.0, -0.6961928009986877, -0.5250730514526367, -0.39491748809814453, 
 -0.28444138169288635, -0.18477343022823334, -0.09105003625154495, 0.0,
  0.07958029955625534, 0.16093020141124725, 0.24611230194568634, 0.33791524171829224, 
  0.44070982933044434, 0.5626170039176941, 0.7229568362236023, 1.0]

Both FP4 and NF4 types are implemented in the bitsandbytes Python library. As an example, let’s convert a [1.0, 2.0, 3.0, 4.0] array to FP4:

from bitsandbytes import functional as bf

def print_uint(val: int, n_digits=8) -> str:
    """ Convert 42 => '00101010' """
    return format(val, 'b').zfill(n_digits)

device = torch.device("cuda")
x = torch.tensor([1.0, 2.0, 3.0, 4.0], device=device)
x_4bit, qstate = bf.quantize_fp4(x, blocksize=64)

print(x_4bit)
# > tensor([[117], [35]], dtype=torch.uint8)

print_uint(x_4bit[0].item())
# > 01110101
print_uint(x_4bit[1].item())
# > 00100011

print(qstate)
# > (tensor([4.]), 
# >  'fp4', 
# >  tensor([ 0.0000,  0.0052,  0.6667,  1.0000,  0.3333,  0.5000,  0.1667,  0.2500,
# >           0.0000, -0.0052, -0.6667, -1.0000, -0.3333, -0.5000, -0.1667, -0.2500])])

The result is interesting. As an output, we get two objects: a 16-bit array [117, 35], actually containing our 4 numbers, and a "state" object, containing the scaling factor 4.0 and the tensor with all 16 FP4 numbers.

As an example, the first 4-bit number is "0111" (=7), and we can see in the state object that the corresponding floating-point value is 0.25; 0.254 = 1.0. The second number is "0101" (=5), and the result is 0.54 = 2.0. For the 3rd number, "0010" is 2, and 0.6664 = 2.666, which is close but not equal to 3.0, obviously, with 4-bit values we have some precision loss. For the last value, "0011" is 3, 1.0004 = 4.0.

Obviously, we don’t have to do it manually; a backward conversion can also be done using bitsandbytes:

x = bf.dequantize_fp4(x_4bit, qstate)
print(x)

# > tensor([1.000, 2.000, 2.666, 4.000])

A 4-bit format also has a limited dynamic range. For example, the array [1.0, 2.0, 3.0, 64.0] will be converted to [0.333, 0.333, 0.333, 64.0]. But for more or less normalized data, the results are really not bad. As an example, let’s draw a sine wave in FP4 format:

import matplotlib.pyplot as plt
import numpy as np
from bitsandbytes import functional as bf

length = np.pi * 4
resolution = 256
xvals = np.arange(0, length, length / resolution)
wave = np.sin(xvals)

x_4bit, qstate = bf.quantize_fp4(torch.tensor(wave, dtype=torch.float32, device=device), blocksize=64)
dq = bf.dequantize_fp4(x_4bit, qstate)

plt.rcParams["figure.figsize"] = (14, 5)
plt.title('FP8 Sine Wave')
plt.plot(xvals, wave)
plt.plot(xvals, dq.cpu().numpy())
plt.show()

Obviously, we can see some accuracy loss, but the result is good enough:

A sine wave data in FP4 format, Image by author
A sine wave data in FP4 format, Image by author

As for the NF4 type, readers can try the "quantize_nf4" and "dequantize_nf4" methods on their own; all code remains the same. Alas, at the moment of writing this article, 4-bit types work only with CUDA; the CPU calculations are not supported yet.

Testing

As a final step in this article, let’s make a neural network model and test it. Using the transformers Python library, it is possible to load the pre-trained model in 4-bit resolution only by setting the _load_in4-bit parameter to True. But let’s be honest, this will not give us any understanding of how it works. Instead, as a toy example, let’s create a small neural network, train it, and use it with 4-bit precision.

First, let’s create a neural network model:

import torch
import torch.nn as nn
import torch.optim as optim
from typing import Any

class NetNormal(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.model = nn.Sequential(
            nn.Linear(784, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        x = self.model(x)
        return F.log_softmax(x, dim=1)

Now we need to prepare the dataset loader. I will be using the MNIST dataset, which contains 70,000 28×28 images of handwritten digits (Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset, which is available under the Creative Commons Attribution-Share Alike 3.0 license). The dataset is split into 60,000 training and 10,000 test images; the selection can be specified in the DataLoader by using the parameter train=True|False.

from torchvision import datasets, transforms

train_loader = torch.utils.data.DataLoader(
    datasets.MNIST("data", train=True, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=batch_size, shuffle=True)

test_loader = torch.utils.data.DataLoader(
    datasets.MNIST("data", train=False, transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=batch_size, shuffle=True)

Now, we are ready to train and save the model. The training process is going in a "normal" way, using a default precision:

device = torch.device("cuda")

batch_size = 64
epochs = 4
log_interval = 500

def train(model: nn.Module, train_loader: torch.utils.data.DataLoader,
          optimizer: Any, epoch: int):
    """ Train the model """
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()

        if batch_idx % log_interval == 0:
            print(f'Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)}]tLoss: {loss.item():.5f}')

def test(model: nn.Module, test_loader: torch.utils.data.DataLoader):
    """ Test the model """
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            t_start = time.monotonic()
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    t_diff = time.monotonic() - t_start

    print(f"Test set: Average loss: {test_loss:.4f}, Accuracy: {correct}/{len(test_loader.dataset)} ({100. * correct / len(test_loader.dataset)}%)n")

def get_size_kb(model: nn.Module):
    """ Get model size in kilobytes """
    size_model = 0
    for param in model.parameters():
        if param.data.is_floating_point():
            size_model += param.numel() * torch.finfo(param.data.dtype).bits
        else:
            size_model += param.numel() * torch.iinfo(param.data.dtype).bits
    print(f"Model size: {size_model / (8*1024)} KB")

# Train
model = NetNormal().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
for epoch in range(1, epochs + 1):
    train(model, train_loader, optimizer, epoch)
    test(model, test_loader)

get_size(model)

# Save
torch.save(model.state_dict(), "mnist_model.pt")

I also created a helper "get_size_kb" method to get the model size in kilobytes.

The training process looks like this:

Train Epoch: 1 [0/60000] Loss: 2.31558
Train Epoch: 1 [32000/60000] Loss: 0.53704
Test set: Average loss: 0.2684, Accuracy: 9225/10000 (92.25%)

Train Epoch: 2 [0/60000] Loss: 0.19791
Train Epoch: 2 [32000/60000] Loss: 0.17268
Test set: Average loss: 0.1998, Accuracy: 9401/10000 (94.01%)

Train Epoch: 3 [0/60000] Loss: 0.30570
Train Epoch: 3 [32000/60000] Loss: 0.33042
Test set: Average loss: 0.1614, Accuracy: 9530/10000 (95.3%)

Train Epoch: 4 [0/60000] Loss: 0.20046
Train Epoch: 4 [32000/60000] Loss: 0.19178
Test set: Average loss: 0.1376, Accuracy: 9601/10000 (96.01%)

Model size: 427.2890625 KB

Our simple model achieved 96% accuracy, and the neural network size is 427 KB.

Time to have some fun! Let’s create and test an 8-bit version of the model. The model description is actually the same; I only replaced the "Linear" layer with "Linear8bitLt".

from bitsandbytes.nn import Linear8bitLt

class Net8Bit(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.model = nn.Sequential(
            Linear8bitLt(784, 128, has_fp16_weights=False),
            nn.ReLU(),
            Linear8bitLt(128, 64, has_fp16_weights=False),
            nn.ReLU(),
            Linear8bitLt(64, 10, has_fp16_weights=False)
        )

    def forward(self, x):
        x = self.flatten(x)
        x = self.model(x)
        return F.log_softmax(x, dim=1)

device = torch.device("cuda")

# Load
model = Net8Bit()
model.load_state_dict(torch.load("mnist_model.pt"))
get_size_kb(model)
print(model.model[0].weight)

# Convert
model = model.to(device)

get_size_kb(model)
print(model.model[0].weight)

# Run
test(model, test_loader)

The output looks like this:

Model size: 427.2890625 KB
Parameter(Int8Params([[ 0.0071,  0.0059,  0.0146,  ...,  0.0111, -0.0041,  0.0025],
            ...,
            [-0.0131, -0.0093, -0.0016,  ..., -0.0156,  0.0042,  0.0296]]))

Model size: 107.4140625 KB
Parameter(Int8Params([[  9,   7,  19,  ...,  14,  -5,   3],
            ...,
            [-21, -15,  -3,  ..., -25,   7,  47]], device='cuda:0',
           dtype=torch.int8))

Test set: Average loss: 0.1347, Accuracy: 9600/10000 (96.0%)

The original model was loaded in the standard floating-point format; its size is the same, and the weights look like [0.0071, 0.0059,…]. Converting the model to "cuda" actually does the "magic", and the model size becomes 4 times smaller. As we can see, the weight values are in the same range, so the conversion was easy – during the test run, there was no accuracy loss at all!

Let’s now test the 4-bit version:

from bitsandbytes.nn import LinearFP4, LinearNF4

class Net4Bit(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.model = nn.Sequential(
            LinearFP4(784, 128),
            nn.ReLU(),
            LinearFP4(128, 64),
            nn.ReLU(),
            LinearFP4(64, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        x = self.model(x)
        return F.log_softmax(x, dim=1)

# Load
model = Net4Bit()
model.load_state_dict(torch.load("mnist_model.pt"))
get_model_size(model)
print(model.model[2].weight)

# Convert
model = model.to(device)

get_model_size(model)
print(model.model[2].weight)

# Run
test(model, test_loader)

The output looks like this:

Model size: 427.2890625 KB
Parameter(Params4bit([[ 0.0916, -0.0453,  0.0891,  ...,  0.0430, -0.1094, -0.0751],
            ...,
            [-0.0079, -0.1021, -0.0094,  ..., -0.0124,  0.0889,  0.0048]]))

Model size: 54.1015625 KB
Parameter(Params4bit([[ 95], [ 81], [109],
            ...,
            [ 34], [ 46], [ 33]], device='cuda:0', dtype=torch.uint8))

Test set: Average loss: 0.1414, Accuracy: 9579/10000 (95.79%)

The result is interesting. After conversion, the model size was reduced 8 times, from 427 to 54 KB, but the accuracy decreased only to 1%. How is it possible? Well, at least for this model, the answer is simple:

  • As we can see, the weights are more or less equally distributed, and the accuracy loss is not too large.
  • A neural network uses a Softmax as an output, and the index of the maximal value determines the actual result. It is easy to understand that for finding the maximal index, the value itself does not matter. For example, it does not make any difference if the value is 0.8 or 0.9 when other values are 0.1 or 0.2!

I think it’s important to see it in more detail. Let’s load the number from a test dataset and check the model output.

dataset = datasets.MNIST('data', train=False, transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ]))

np.set_printoptions(precision=3, suppress=True)  # No scientific notation

data_in = dataset[4][0]
for x in range(28):
    for y in range(28):
        print(f"{data_in[0][x][y]: .1f}", end=" ")
    print()

The print output shows the number we want to predict:

Let’s see what a "standard" model will return:

# Suppress scientific notation
np.set_printoptions(precision=2, suppress=True)  

# Predict
with torch.no_grad():
    output = model(data_in.to(device))
    print(output[0].cpu().numpy())
    ind = output.argmax(dim=1, keepdim=True)[0].cpu().item()
    print("Result:", ind)

# > [ -8.27 -13.89  -6.89 -11.13  -0.03  -8.09  -7.46  -7.6   -6.43  -3.77]
# > Result: 4

The maximum element is located at the 5th position (elements in the numpy array are numerated from 0), which corresponds to the number "4".

This is an output of the 8-bit model:

# > [ -9.09 -12.66  -8.42 -12.2   -0.01  -9.25  -8.29  -7.26  -8.36  -4.45]
# > Result: 4

And the 4-bit:

# > [ -8.56 -12.12  -7.52 -12.1   -0.01  -8.94  -7.84  -7.41  -7.31  -4.45]
# > Result: 4

It is easy to see that the actual output values are different, but the maximum index remains the same.

Conclusion

In this article, we tested different encoding schemes for 16-bit, 8-bit, and 4-bit floating point numbers, created a neural network, and were able to run it with 8-bit and 4-bit precision. And it is actually interesting to see how it works. By reducing the precision from a standard to a 4-bit float, the memory footprint was reduced 8 times, but the accuracy loss was minimal. Obviously, this is only a toy example, and in really large models, much more sophisticated methods are used (those who are interested are welcome to read this Huggingface blog post).

I hope this article was helpful in understanding the general ideas behind floating-point calculations. And as we know, "necessity is the mother of invention". Reducing the memory footprint 4–8 times is a great achievement, especially considering the price difference between 8, 16, 32, and 64 GB GPU cards 😉 By the way, even 4 bits is already not a limit; in the GPTQ paper, the possibility of quantifying the weights into 2 or even into a ternary (1.5 bits!) state was mentioned. Last but not least, it is also interesting to think about the "accuracy" of neurotransmitters in our human brain. Intuitively, it is not that big, and maybe 2 or 4-bit models are actually closer to what we have.

Thanks for reading. If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors.


Related Articles