The tokenization process that feeds into ChatGPT

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

At first, this article sounds too cute: implement ChatGPT using only SQL, but in fact, this has a very good, easy-to-read description of the process of tokenization:

For instance, we could have separate numbers for “Post”, “greSQL” and “ing”. This way, the words “PostgreSQL” and “Posting” would both have a length of 2 in our representation. And of course, we would still maintain separate code points for shorter sequences and individual bytes. Even if we come across gibberish or a text in a foreign language, it would still be encodable, albeit longer.

GPT2 uses a variation of the algorithm called Byte pair encoding to do precisely that. Its tokenizer uses a dictionary of 50257 code points (in AI parlance, “tokens”) that correspond to different byte sequences in UTF-8 (plus the “end of text” as a separate token).

This dictionary was built by statistical analysis performed like this:

1. Start with a simple encoding of 256 tokens: one token per byte.

2. Take a large corpus of texts (preferably the one the model will be trained on).

3. Encode it.

4. Calculate which pair of tokens is the most frequent. Let’s assume it’s 0x20 0x74 (space followed by the lowercase “t”).

5. Assign the next available value (257) to this pair of bytes.

6. Repeat the steps 3-5, now paying attention to the byte sequences. If a sequence of bytes can be encoded with a complex token, use the complex token. If there are ambiguities (say, “abc” can, at some point, be encoded as “a” + “bc” or “ab” + “c”), use the one with the lowest number (because it was added earlier and hence is more frequent). Do this recursively until all sequences that can collapse into a single token will collapse into a single token.

7. Perform the collapse 50000 times over.

The number 50000 was chosen more or less arbitrarily by the developers. Other models keep the number of tokens in a similar range (from 30k to 100k).

At every iteration of this algorithm, a new token that is a concatenation of two previous ones will be added to the dictionary. Ultimately, we will end up with 50256 tokens. Add a fixed-number token for “end-of-text”, and we’re done.

The GPT2 version of BTE has another layer of encoding: the token dictionary maps tokens to strings and not arrays of bytes. Mapping from bytes to string characters is defined in this function. We will save the dictionary it produces in the table encoder.

Post external references

  1. 1
    https://explainextended.com/2023/12/31/happy-new-year-15/
Source