The orthogonal properties of each word are captured in multidimensional embeddings

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at:, or follow me on Twitter.

I am curious how they settled on 768 as the correct number of dimensions for each word (or rather, token).

The tokens represent parts of the human languages (about 0.75 words per token, in general), so any model that is trying to succeed at text completion should somehow encode the relationships between these parts. Even in isolation, the parts of the speech have sets of orthogonal properties.

Let’s take the word “subpoena” (which happens to have a whole token in itself in the GPT2 tokenizer). Is it a noun? Yes, very much so. Is it a verb? Well, sort of. Is it an adjective? Not that much, but it can be if you squint hard enough. Is it legalese? Hell yes. And so on.

All these properties are orthogonal, i.e. independent of each other. A word can be a legalese noun but not an adjective or a verb. In English, any combination thereof can happen.

Things with orthogonal properties are best encoded using vectors. Instead of having a single property (like a token number), we can have many. And it helps if we can wiggle them as we want. For instance, for a word to continue the phrase “A court decision cited by the lawyer mentions the …” we would probably want something that’s heavy on the legalese dimension and at the same time heavy on being a noun. We don’t really care if it has a side hustle being an adjective, a verb, or a flower.

In math, mapping narrower values into wider spaces (such as token IDs to vectors) is called an embedding. This is exactly what we are doing here.

How do we decide which properties these vectors represent? We don’t. We just provide enough vector space for every token and hope that the model during its training phase will populate these dimensions with something meaningful. GPT2 uses 768 dimensions for its vectors. There is no telling in advance (and, actually, even in the retrospective) what property of the word will, say, the dimension 247 encode. Surely it would encode something, but it’s not easy to tell what it is.

What properties of each token do we want to embed in the vector space? Anything that has any bearing on what the next token would be.

Post external references

  1. 1