Language Model

from __init__ import install_dependencies, show

await install_dependencies()

import os
from IPython.display import JSON
import transformers as tfm
import torch

Problem Formulation¶

What is a language model?

From the wiki page:

A language model is a probabilistic model of a natural language.

To put it simply, a causal language model completes an input prompt such as

A language model is ...

into a realistic text like the one from the wiki page. More formally:

Definition 1 (language model)

A language model is a generative (artificial neural) network trained on a dataset of samples/examples of random text/source $\R{s}$ to generate realistic text $\hR{s}$ when given a prompt $\R{u}$ for $\R{s}$ . The goal is to make the conditional pmf $p_{\hR{s}|\R{u}}$ as close as possible to $p_{\R{s}|\R{u}}$ but without knowing knowing the joint distribution of $\R{u}$ and $\R{s}$ . The statistical “closeness” can be measured by a divergence such as the Kullback-Leibler (information) divergence

D(\R{s}\|\hR{s}|\R{u}) = \underbrace{\E\left[ \log \frac1{p_{\hR{s}|\R{u}}(\R{s}|\R{u})}\right]}_{\text{cross entropy $H(\R{s}\|\hR{s}|\R{u})$}} - \underbrace{\E\left[\log \frac{1}{p_{\R{s}|\R{u}}(\R{s}|\R{u})} \right]}_{\text{entropy $H(\R{s}|\R{u})$}}.

(1)

For the divergence in (1) to be called a divergence, it satisfies the following property:^[1]

Proof

To prove the non-negativity (2) of the divergence, we start from equation (1):

\begin{align} \E\left[ \log p_{\R{s}|\R{u}}(\R{s}|\R{u}) - \log p_{\hR{s}|\R{u}}(\R{s}|\R{u}) \right] &= \E\left[\log \frac{p_{\R{s}|\R{u}}(\R{s}|\R{u})}{p_{\hR{s}|\R{u}}(\R{s}|\R{u})} \right]\\ &\utag{a}= \E\left[\frac{p_{\R{s}|\R{u}}(\hR{s}|\R{u})}{p_{\hR{s}|\R{u}}(\hR{s}|\R{u})} \log \frac{p_{\R{s}|\R{u}}(\hR{s}|\R{u})}{p_{\hR{s}|\R{u}}(\hR{s}|\R{u})} \right]\\ &\utag{b}\geq \E\left[\frac{p_{\R{s}|\R{u}}(\hR{s}|\R{u})}{p_{\hR{s}|\R{u}}(\hR{s}|\R{u})}\right] \log \E\left[\frac{p_{\R{s}|\R{u}}(\hR{s}|\R{u})}{p_{\hR{s}|\R{u}}(\hR{s}|\R{u})} \right]\\ &\utag{c}= \E\left[\sum_{x} p_{\R{s}|\R{u}}(x|\R{u}) \right] \log \E\left[\sum_{x} p_{\R{s}|\R{u}}(x|\R{u}) \right]\\ &\utag{d}= 1\cdot \log 1 = 0, \end{align}

(4)

$\uref{a}$ and $\uref{c}$ follow from the definition of expectation. To show $\uref{a}$ , note that the R.H.S. (with an appropriate choice of $f$ ) is
$\begin{align} \E\left[\frac{p_{\R{s}|\R{u}}(\hR{s}|\R{u})}{p_{\hR{s}|\R{u}}(\hR{s}|\R{u})} f(\R{u},\R{s})\right] &= \E\left[\sum_{x} \sout{p_{\hR{s}|\R{u}}(x|\R{u})} \frac{p_{\R{s}|\R{u}}(x|\R{u})}{\sout{p_{\hR{s}|\R{u}}(x|\R{u})}} f(\R{u},x)\right]\\ &= \E\left[\sum_{x} p_{\R{s}|\R{u}}(x|\R{u}) f(\R{u},x)\right]\\ &= \E\left[f(\R{u},\R{s})\right], \end{align}$
(5)
which gives the L.H.S. of $\uref{a}$ . $\uref{c}$ can be shown similarly with $f(u,x)=1$ .
$\uref{b}$ results from applying Jensen’s inequality (see Lemma 1 below) to the convex function $r \mapsto r \log r$ .
Since $r \mapsto r \log r$ is strictly convex, the inequality holds with equality if and only if $\frac{p_{\R{s}|\R{u}}(\hR{s}|\R{u})}{p_{\hR{s}|\R{u}}(\hR{s}|\R{u})}$ for some constant $C$ almost surely. However, $C$ must be 1, which implies (3), as probability mass must sum to 1 over all possible outcomes, i.e.,
$\sum_x p_{\R{s}|\R{u}}(x|\R{u}) = 1 = \sum_x p_{\hR{s}|\R{u}}(x|\R{u}),$
(6)
which also justifies $\uref{d}$ . (Q.E.D.)

The above proof relies on the strict convexity of $f(r):= r\log(r)$ , i.e.:

YOUR ANSWER HERE

Tokenization¶

Just like we compose a text using words from a vocabulary, a language model also generates a text from a vocabulary consisting of meaningful units called tokens.

The following code creates a tokenizer from the configuration files under model_path using AutoTokenizer.from_pretrained:

# Load the tokenizer
model_path = "/models/hf/Phi-3.5-mini-instruct/"
tokenizer = tfm.AutoTokenizer.from_pretrained(model_path)
show(tokenizer)

The configurations of the tokenizer is specified in JSON format, which is a collection of key/value pairs where the keys are names given as strings:

JSON(filename=os.path.join(model_path, "tokenizer_config.json"))

JSON(filename=os.path.join(model_path, "special_tokens_map.json"))

JSON(filename=os.path.join(model_path, "tokenizer.json"))

To encode and decode a text using the tokenizer:

text = "A language model is a probabilistic model of a natural language."
ids = tokenizer.encode(text)
decoded_text = tokenizer.decode(ids)
assert text == decoded_text
ids

show(tokenizer.encode)

show(tokenizer.decode)

For efficient implementation of encode and decode, tokens are represented by integers known as the token ID’s. The mapping from ID’s to tokens is provided by the dictionary below:

show(tokenizer.vocab)

To obtain the tokens from ID’s, we can use the method convert_ids_to_tokens:

tokens = tokenizer.convert_ids_to_tokens(ids)
tokens

def reverse_dict(d):
    # YOUR CODE HERE
    raise NotImplementedError

# tests
reversed_vocab = reverse_dict(tokenizer.vocab)
assert tokens == [reversed_vocab[i] for i in ids]

Note that a token needs not be an English word. For instance:

tokens[5], tokens[6], tokens[-1]

The tokens can be punctuations such as . and even subwords that are meaningful by itself such as ▁probabil and isstic.

YOUR ANSWER HERE

Generation¶

A language model generates a text one token at a time just like we speak a text word-by-word. The model is probabilistic in the sense each token is generated randomly according some distribution. The sequence of randomly generated tokens is called a stochastic/random process. If each token is generated independently based on some previously generated tokens, the process is said to be auto-regressive.

Definition 4 (auto-regressive generation)

The generated text $\hR{s}$ of a language model in Definition 1 is the decoding $\hR{s} = f^{-1}(\R{x})$ of the sequence of tokens $\R{x}$ in (11) where:

For some integer $n>0$ called the context length, the new tokens $\R{x}_{n+t}$ for $t\in \mathbb{N}$ is sampled independently based on the realization of the last $n$ tokens $\R{x}_{t:n+t}$ called the context, i.e.,
$p_{\R{x}_{n+t}|\R{x}_{:n+t}}(x_{n+t}|x_{:n+t}) = p_{\R{x}_{n+t}|\R{x}_{t:n+t}}(x_{n+t}|x_{t:n+t})$
(12)
for all $x_{:n+t+1}\in \mc{X}^{n+t+1}$ .
the initial sequence of tokens is the sequence $\R{x}_{:n} = f(\R{s})$ of tokens for the input prompt $\R{s}$ .

A potential confusion is to think that a token cannot depend on other tokens outside the context.

YOUR ANSWER HERE

If $\R{s}$ is a sequence of tokens shorter (longer) than the context length $n$ , it can be left-padded (left-truncated) by the tokenizer:

text = "A language model is a probabilistic model of a natural language."
encoding = tokenizer(text, padding='max_length', truncation=True)
encoding.keys(), len(encoding.input_ids), len(encoding.attention_mask)

show(tokenizer.__call__)

The above call to tokenizer returns a dictionary consisting of two lists, both with the same length as the context length:

tokenizer.model_max_length

input_ids points to the list of token ID’s:

show(encoding.input_ids)

Note the input_ids is left-padded by the padding token ID:

tokenizer.pad_token_id, tokenizer.pad_token

Intuitively, the padding tokens should not be used to generate new tokens. To avoid unnecessary computations, the attention mask explicitly gives 0 attention/importance/weight to those special tokens:

show(encoding.attention_mask)

There are also other special tokens that should normally be masked off:

tokenizer.special_tokens_map_extended

To load a language model:

bnb_config = tfm.BitsAndBytesConfig(load_in_8bit=True)
model = tfm.AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    low_cpu_mem_usage=True,
)
# Use GPU if available
if torch.cuda.is_available() and model.device.type != "cuda":
    model = model.to("cuda")
print(f"Model loaded on device: {model.device}")
print(model)

A language model is a type of neural network consisting of layers of computational units called neurons. To generate text quickly, the above code attempts to utilize a Graphics Processing Unit (GPU) whenever available. It further quantizes the model to a lower precision, specifically 8-bit instead of the original 16-bit, to reduce the memory footprint.

Finally, to generate the text, run the following cell:

# Tokenize input text and generate output
u = "A language model is"
encoding = tokenizer(u, return_tensors="pt")
# Use GPU if available
if torch.cuda.is_available() and encoding.input_ids.device.type != 'cuda':
    encoding = encoding.to("cuda")

# Generate response
with torch.no_grad():
    shat_ids = model.generate(**encoding, max_length=100)

# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat)

Note that repeatedly runing the above code will generate the same text. This is because, instead of sampling from the distribution in (12), it makes a hard decision:

To perform the sampling, we can pass the keyword argument do_sample=True to model.generate as follows:

# Generate response
with torch.no_grad():
    shat_ids = model.generate(**encoding, max_length=100, do_sample=True)

# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat)

Verify that the code generates the tokens randomly by running it repeatedly.

The generated text might have been cut off in the middle of a sentence. Although you can increase max_length to a sufficiently large value to ensure that generation terminates with an end-of-sequence token (eos_token), this can result in excessively long outputs. Fortunately, there are other stopping criteria implemented that can help control the length and content of the generated text more effectively.

# Assign the desired stopping criteria to `stopping_criteria`.
# YOUR CODE HERE
raise NotImplementedError
stopping_criteria

# Generate response
with torch.no_grad():
    shat_ids = model.generate(**encoding, 
                              max_length=2000, # Make this big as the default is 20
                              stopping_criteria=stopping_criteria
                             )

# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat.strip())

# test
assert "A language model is" in shat
assert "\n" not in shat.strip()

Chat Completion¶

A language model can also be trained to complete a chat, following the ChatGPT. The Chat Completion API. A chat can be represented as a list of chat messages:

chat = [
    {"role": "system", "content": "You are an AI engineer who knows language models so well that you can explain the theory to a first-year undergraduate without any background."},
    {"role": "user", "content": "What is a language model?"}
]

Each message is associated with a role:

The system message sets the behavior for the AI assistant.
The user message represents the user’s query.

The tokenizer can be used to convert the list of messages into a single text for the language model to complete in the same way as before:

# Apply the chat template
formatted_chat = tokenizer.applx_chat_template(chat, tokenize=False, add_generation_prompt=True)
print("Formatted chat:\n", formatted_chat)

Note that <|system|>, <|user|>, <|assistant|>, and <|end|> are special tokens used to mark the different chat messages. The chat template can be printed as follows:

chat_template = tokenizer.get_chat_template()
print("Chat template:\n", chat_template)

This is a Jinja template, which uses python programming syntax such as iterations and conditionals to render the text from an input list messages of dictionaries.

We can now call the language model to complete the text as before:

# Tokenize input text and generate output
u = formatted_chat
encoding = tokenizer(u, return_tensors="pt")
# Use GPU if available
if torch.cuda.is_available() and encoding.input_ids.device.type != 'cuda':
    encoding = encoding.to("cuda")

# Generate response
with torch.no_grad():
    shat_ids = model.generate(**encoding, max_length=200)

# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat)

def decode_chat_messages(ids):
    roles = {32006: "system", 32010: "user", 32001: "assistant"}
    output = []
    # YOUR CODE HERE
    raise NotImplementedError
    return output

# tests
generated_text = """
<|system|> You are an AI engineer who knows language models so well that you can explain the theory to a first-year undergraduate without any background.<|end|>
<|user|> What is a language model?<|end|>
<|assistant|> A language model is a type of artificial intelligence (AI) system that is designed to understand, interpret, and generate human language. It is a mathematical representation of how words and phrases are likely to occur in a given language. Language models are used in various applications, such as speech recognition, machine translation, text generation, and natural language processing (NLP).
"""

assert decode_chat_messages(tokenizer.encode(generated_text)) == [
    {
        "role": "system",
        "content": "You are an AI engineer who knows language models so well that you can explain the theory to a first-year undergraduate without any background.",
    },
    {"role": "user", "content": "What is a language model?"},
    {
        "role": "assistant",
        "content": "A language model is a type of artificial intelligence (AI) system that is designed to understand, interpret, and generate human language. It is a mathematical representation of how words and phrases are likely to occur in a given language. Language models are used in various applications, such as speech recognition, machine translation, text generation, and natural language processing (NLP).\n",
    },
]