LLMs and different quantizations

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

An interesting look at different approaches to compression and reducing memory consumption by using small bit ranges to store values.

Quantization refers to converting an LLM from its original Float32 representation to something smaller. However, we do not simply want to use a smaller bit variant but map a larger bit representation to a smaller bit without losing too much information.

In practice, we see this often done with a new format, named 4bit-NormalFloat (NF4). This datatype does a few special tricks in order to efficiently represent a larger bit datatype. It consists of three steps:

Normalization: The weights of the model are normalized so that we expect the weights to fall within a certain range. This allows for more efficient representation of more common values.

Quantization: The weights are quantized to 4-bit. In NF4, the quantization levels are evenly spaced with respect to the normalized weights, thereby efficiently representing the original 32-bit weights.

Dequantization: Although the weights are stored in 4-bit, they are dequantized during computation which gives a performance boost during inference.
To perform this quantization with HuggingFace, we need to define a configuration for the quantization with Bitsandbytes:

from transformers import BitsAndBytesConfig

from torch import bfloat16

# Our 4-bit configuration to load the LLM with less GPU memory
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # 4-bit quantization
bnb_4bit_quant_type=’nf4′, # Normalized float 4
bnb_4bit_use_double_quant=True, # Second quantization after the first
bnb_4bit_compute_dtype=bfloat16 # Computation type

This configuration allows us to specify which quantization levels we are going for. Generally, we want to represent the weights with 4-bit quantization but do the inference in 16-bit.

Post external references

  1. 1