Transformer - CS1302

from __init__ import install_dependencies, show

await install_dependencies()

import os
from IPython.display import JSON
import ipywidgets as widgets
from collections.abc import Iterable
import transformers as tfm
import torch

The success of large language models (LLM) can be attributed to

the advancement of computing devices, such as Graphics Processing Units (GPUs),
the availability of a large corpus of data, and
the development of deep learning architectures and techniques

that make it computationally feasible to train sophisticated language models on a large amount of data. In this notebook, we will introduce basic achitecture of LLM, which can be trained to capture important information for generating text.

Neural Network¶

Let’s visualize the training process! Click the play button (▶) below to train a neural network that predicts the color of a point $(X_1,X_2)$ :

Figure 1:A single neuron for a linearly separable dataset. (open in new tab)

The output of the neural network is plotted as a heatmap:

The blue region is the decision region for classifying a point to be positive/blue.
The orange region is the decision region for classifying a point to be negative/orange.
The white line is a decision boundary separating the decision regions.

The process of computing the output of a neural network is called the forward propagation:

Example 1 (forward propagation)

In Figure 1, the output of the neuron is computed as

f_{\theta}(X_1, X_2) := \frac{1}{1+e^{-(w_0 X_1 + w_1 X_2 + b)}}

(1)

from the input $(a_0, a_1)=(X_1, X_2)$ by

taking their affine combination
$(a_0, a_1) \mapsto w_0 a_0 + w_1 a_1 + b$
(2)
using the parameters $\theta:=(w_0, w_1, b)$ consisting of
- the weight $w:=(w_0, w_1)\in \mathbb{R}^2$ , and
- the bias $b\in\mathbb{R}$ ; and
passing affine combination as input to a non-linear activation function such as the sigmoid function
$z\mapsto \frac{1}{1+e^{-z}}.$
(3)

For accurate prediction, the neural network is trained by examples called the training set to minimize an objective function called the loss function:

The optimization is done step-by-step using a numerical method called the gradient descent:

Traing speed and stability depends on the choice of the activation functi9n. Another common non-linear activation function is ReLU (Rectified Linear Unit) defined as

z\mapsto \max(0, z).

(7)

ReLU makes training faster as the slope does not diminish at the tails of the function. Sigmoid makes training more stable as it is smooth. SiReLU (Sigmoid-Rectified Linear Unit) is another activation function that combine the benefits of both. It can be defined as

z \mapsto \frac{z}{1+e^{-z}}.

(8)

To classify more complicated data, a neural network can have more neurons arranged in modules:

Figure 2:A more complex neural network and dataset. (open in new tab)

YOUR ANSWER HERE

Transformer¶

Recall the following code which

creates a tokenizer from the configuration files under model_path using AutoTokenizer.from_pretrained, and
load the language model, using GPU whenever possible, and quantizes it to 8-bit for each parameter to reduce the memory footprint.

# Load the tokenizer
model_path = "/models/hf/Phi-3.5-mini-instruct/"
tokenizer = tfm.AutoTokenizer.from_pretrained(model_path)

# Load the model
bnb_config = tfm.BitsAndBytesConfig(load_in_8bit=True)
model = tfm.AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    low_cpu_mem_usage=True,
)
# Use GPU if available
if torch.cuda.is_available() and model.device.type != "cuda":
    model = model.to("cuda")
print(f"Model loaded on device: {model.device}")
print(model)

The model is composed on modules of interconnected modules.

Model loaded on device: cuda:0
Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3SdpaAttention(
          (o_proj): Linear8bitLt(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear8bitLt(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3LongRoPEScaledRotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear8bitLt(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear8bitLt(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
      )
    )
    (norm): Phi3RMSNorm((3072,), eps=1e-05)
  )
  (lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)

The source code can be found here. To inspect the input and output of each module, we can register a function, called a hook, to be executed whenever a module computes an output:

# list of handles of registered hook
try:
    handles  # avoid overwriting if defined
except NameError:
    handles = []  # list of handles to the hook

# A hook to run every time after the forward method has computed an output
def hook(module, args, output):
    modules.append({"module": module, "args": args, "output": output})

# Register the forward hook to each module
modules = [*model.modules()]
for module in modules:
    handles.append(module.register_forward_hook(hook))

To generate the a new token and record the outputs of the modules:

# Initialize the list that stores the modules and their inputs and outputs
modules = []

# Tokenize input text and generate output
u = "A language model is"
encoding = tokenizer(u, return_tensors="pt")
# Use GPU if available
if torch.cuda.is_available() and encoding.input_ids.device.type != 'cuda':
    encoding = encoding.to("cuda")

# Generate response
with torch.no_grad():
    shat_ids = model.generate(**encoding, max_new_tokens=1)

# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat)

To display the modules:

@widgets.interact(i=widgets.Dropdown(
    options=[(str(i)+': '+repr(modules[i]['module']),i) for i in range(len(modules))],
    value=2,
    description='Module:',
    layout=widgets.Layout(width="90%")    
))
def show_module(i):
    def show_values(seq):
        for v in seq:
            print(' '*2 + str(type(v)) + (isinstance(v, torch.Tensor) and f' with shape: {v.shape}' or ''))
            show(v)
    print("Module:")
    show(modules[i]['module'])
    print("Input(s):")
    show_values(modules[i]['args'])
    print("Output(s):")
    outputs = modules[i]['output']
    if not isinstance(outputs, Iterable):
        outputs = [outputs]
    show_values(outputs)

To clean up the hooks so they do not get executed again:

# Clean up: Remove the added hook
for handle in handles:
    handle.remove()
handles = []

Logits in Final Layer¶

Recall that an auto-regressive language model uses the conditional distribution $p_{\R{x}_{n+t}|\R{x}_{t:n+t}}(x_{n+t}|x_{t:n+t})$ to generate the new token $\R{x}_{n+t}$ from an existing sequence $x_{t:n+t}$ of tokens.

The last registered output of the model contains the logits

\begin{align} l := \left[\log p_{\R{x}_{n+t}|\R{x}_{t:n+t}}(x_{n+t}|x_{t:n+t}) + c\right]_{x_{n+t}\in \mc{X}}, \end{align}

(10)

which are the log likelihood probabilities $\log p_{\R{x}_{n+t}|\R{x}_{t:n+t}}(\cdot|x_{t:n+t})$ shifted by some constant $c\in \mathbb{R}$ .

logits = modules[-1]["output"].logits
show(logits.cpu().numpy().tolist())

The logits is the output of the last layer of the neural network:

show_module(-2)

By default, the new token to generate is obtained by hardening the logits directly as follows. There is no need to compute the likelihood probabilities first.

next_token_id = torch.argmax(logits, dim=-1)
print(tokenizer.batch_decode(next_token_id)[0])
next_token_id, tokenizer.convert_ids_to_tokens(next_token_id)

Solution to Exercise 2

This is because $\log(\cdot) + c$ is strictly increasing.

# YOUR CODE HERE
raise NotImplementedError
p
show(p.cpu().numpy().tolist())

# tests
# check whether p is stochastic.
assert ((0 <= p) & (p <= 1)).all() and (p.sum(dim=-1) == 1).all()
assert (torch.argmax(p, dim=-1) == torch.argmax(logits, dim=-1)).all()

Embedding in First Layer¶

The first layer of the model is called the embedding layer:

show_module(0)

The layer uses a pretrained embedding function to embed each token into a high-dimensional vector space:

Definition 2 (Embedding)

An input sequence $x_{t:n+t}\in \mathbb{R}^n$ of token is represented by a matrix

X := \begin{bmatrix} X_{0,0} & X_{0,1} & \cdots & X_{0,d-1} \\ \vdots \\ \color{blue}X_{i, 0} & \color{blue}X_{i, 1} & \color{blue}\cdots & \color{blue}X_{i, d-1} \\ \vdots \\ X_{n-1, 0} & X_{n-1, 1} & \cdots & X_{n-1, d-1} \end{bmatrix} \in \mathbb{R}^{n\times d},

(13)

where each token $x_{t+i} \in \mathcal{X}$ for $i\in [n]$ is represented by the real vector, called the embedding of $x_{t+i}$ , in the $i$ -th row

X_{i,:} = \begin{bmatrix} \color{blue}X_{i,0} & \color{blue}X_{i,1} & \color{blue}\cdots & \color{blue}X_{i,d-1} \end{bmatrix} := g(x_{t+i}) \in \mathbb{R}^d.

(14)

$g$ is the embedding function that embeds a token into the $d$ -dimensional vector space. It is pretrained such that the distances between the embeddings of different tokens capture the differences in the meanings of the tokens. For instance, consider the embeddings of ‘dog’, ‘cat’, and ‘car’:

token_ids = torch.tensor(tokenizer.encode('dog cat car'))

with torch.no_grad():
    embeddings = modules[0]['module'].forward(token_ids)

embeddings

Dog is more similar to cat than to car using the cosine similarlity:

\begin{align} (a,b)\in \mathbb{R}^{d}\times \mathbb{R}^{d} \mapsto \frac{ab^{\intercal}}{\norm{a}\norm{b}} &:= \frac{\sum_{i\in [d]} a_i b_i}{\sqrt{\sum_{i\in [d]} a_i^2} \sqrt{\sum_{i\in [d]} b_i^2}}\\ = \cos \theta_{ab} \end{align}

(15)

where $\theta_{ab}$ is the angle between the vectors $a$ and $b$ .

# dog vs cat
torch.nn.functional.cosine_similarity(embeddings[0], embeddings[1], dim=-1)

# dog vs car
torch.nn.functional.cosine_similarity(embeddings[0], embeddings[2], dim=-1)

Note that cosine_similarity is a universal function, so we can compute the similarities in one go:

torch.nn.functional.cosine_similarity(embeddings[0], embeddings, dim=-1)

# YOUR CODE HERE
raise NotImplementedError
similarity_matrix

# tests
assert (
    similarity_matrix.cpu() ** 2
    - torch.tensor(
        [[1.0000, 0.0533, 0.0157], [0.0533, 1.0000, 0.0177], [0.0157, 0.0177, 1.0000]]
    ).abs()
    < 1e-4
).all()

Attention Mechanism¶

An important component of the transformer architecture is the attention mechanism proposed by Vaswani et al. 2017, which can be efficiently trained and computed to focus on the most relevant parts of the context, which can be longer than those of the traditional sequential architectures such as RNN and LSTM.

For the current mode, there is a stack 32 decoder modules, each of which implements the attention mechanism.

show_module(5)

The attention function is defined as follows and the source code of the attention module can be found here.

Definition 3 (Attention Layer)

Define the attention function

\begin{align} \alpha(Q,K,V) &:= \sigma\left(\frac1{\sqrt{d}}QK^{\intercal}\right)V \end{align}

(16)

where σ is the softmax function in (12), and

$Q\in \mathbb{R}^{n \times d}$ is called the query matrix;
$K\in \mathbb{R}^{n \times d}$ is called the key matrix; and
$V\in \mathbb{R}^{d \times m}$ is called the value matrix.

A (multihead) attention layer is the attention function parameterized by the weight $W$ :

\begin{align} \mu_{W}(Q, K, V) &:= \sum_{i\in [h]} \alpha(QW_{0,i}, K W_{1,i},VW_{2,i}) W_{3,i}, \end{align}

(17)

where,

$d=l=m \times h$ for some $m$ and $h$ ; and
for $i\in [h]:=\Set{0,\dots,h-1}$ ,
- $W_{0,i}, W_{1,i}, W_{2,i} \in \mathbb{R}^{d\times m}$ and
- $W_{3,i} \in \mathbb{R}^{m \times l}$ .

The query $Q$ , key $K$ , and value $V$ are derived from the input embeddings $X$ , as in $\mu_{W}(X, X, X)$ . Specifically, $Q$ and $K$ are used to calculate attention scores, while $V$ holds the actual values to be attended to based on these scores.

Note that the different rows of $Q$ go through the same linear transformations, i.e., the position information is immaterial in the calculation of the attention score. In order to weight the importance of a token in the context based on its position relative the new token to be generated, an additional positional encoding is needed.

show_module(3)

An example is the rotary positional encoding (not the one used above) proposed by Su et al. 2021 is as follows:

Definition 4 (rotary positional encoding)

Define the rotation matrix

\begin{align*} R(\theta) &:= \begin{bmatrix} \cos(\theta) & \sin(\theta) \\ -\sin(\theta) & \cos(\theta) \end{bmatrix}. \end{align*}

(18)

The rotary positional encoding of an embedding $X\in \mathbb{R}^{n\times d}$ is a matrix $Z \in \mathbb{R}^{n\times d}$ with the same dimension as $X$ such that row $i\in [n]$ is encoded as

Z_{i,:} = \begin{bmatrix} Z_{i,0} & Z_{i,1} & \cdots & \color{blue}Z_{i,2j} & \color{blue}Z_{i,2j+1} & \cdots & Z_{i,d-2} & Z_{i,d-1} \end{bmatrix} := g_i(X_{i,:}) \in \mathbb{R}^d,

(19)

where, for $j\in [d/2]$ (with $d$ chosen to be even),

\begin{align} \begin{bmatrix} \color{blue}Z_{i,2j} & \color{blue}Z_{i,2j+1} \end{bmatrix} &:= \begin{bmatrix} X_{i,2j} & X_{i,2j+1} \end{bmatrix} R(i \theta_j) && \text{and}\\ \theta_j &:= b^{-2j/d} \end{align}

(20)

for some positive integer base $b$ , and $R$ is the rotation matrix in (18).

$Z$ or its linear transformations is passed to the attention function as in $\mu_{W}(Z, Z, Z)$ .