The word position embedding is the context window of GPT

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at:, or follow me on Twitter.

A clear expiation of GPT architecture:

Among other parameters of GPT2, there are two matrixes called WTE (word token embedding) and WPE (word position embedding). As the names suggest, the former stores embeddings of the tokens, and the latter stores embeddings of the positions. The actual values of these embeddings have been populated (“learned”) during the training of GPT2. As far as we are concerned, they are constants that live in the database tables wte and wpe.

WTE is 50257×768 and WPE is 1024×768. The latter means that the maximum number of tokens that we can use in a prompt to GPT2 is 1024. If we provide more tokens in the prompt, we just won’t be able to pull positional embeddings for them. It’s an architectural aspect (“hyperparameter” in AI parlance) of the model that is set at design time and cannot be changed by training. When people talk about the “context window” of an LLM, they mean this number.

Post external references

  1. 1