Skip to article frontmatterSkip to article content
from __init__ import install_dependencies, show

await install_dependencies()
import os
from IPython.display import JSON
import ipywidgets as widgets
from collections.abc import Iterable
import transformers as tfm
import torch

The success of large language models (LLM) can be attributed to

  1. the advancement of computing devices, such as Graphics Processing Units (GPUs),
  2. the availability of a large corpus of data, and
  3. the development of deep learning architectures and techniques

that make it computationally feasible to train sophisticated language models on a large amount of data. In this notebook, we will introduce basic achitecture of LLM, which can be trained to capture important information for generating text.

Neural Network

Let’s visualize the training process! Click the play button (▶) below to train a neural network that predicts the color of a point (X1,X2)(X_1,X_2):

Figure 1:A single neuron for a linearly separable dataset. (open in new tab)

The output of the neural network is plotted as a heatmap:

  • The blue region is the decision region for classifying a point to be positive/blue.
  • The orange region is the decision region for classifying a point to be negative/orange.
  • The white line is a decision boundary separating the decision regions.

The process of computing the output of a neural network is called the forward propagation:

For accurate prediction, the neural network is trained by examples called the training set to minimize an objective function called the loss function:

The optimization is done step-by-step using a numerical method called the gradient descent:

Traing speed and stability depends on the choice of the activation functi9n. Another common non-linear activation function is ReLU (Rectified Linear Unit) defined as

zmax(0,z).z\mapsto \max(0, z).

ReLU makes training faster as the slope does not diminish at the tails of the function. Sigmoid makes training more stable as it is smooth. SiReLU (Sigmoid-Rectified Linear Unit) is another activation function that combine the benefits of both. It can be defined as

zz1+ez.z \mapsto \frac{z}{1+e^{-z}}.

To classify more complicated data, a neural network can have more neurons arranged in modules:

Figure 2:A more complex neural network and dataset. (open in new tab)

YOUR ANSWER HERE

Transformer

Recall the following code which

  1. creates a tokenizer from the configuration files under model_path using AutoTokenizer.from_pretrained, and
  2. load the language model, using GPU whenever possible, and quantizes it to 8-bit for each parameter to reduce the memory footprint.
# Load the tokenizer
model_path = "/models/hf/Phi-3.5-mini-instruct/"
tokenizer = tfm.AutoTokenizer.from_pretrained(model_path)

# Load the model
bnb_config = tfm.BitsAndBytesConfig(load_in_8bit=True)
model = tfm.AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    low_cpu_mem_usage=True,
)
# Use GPU if available
if torch.cuda.is_available() and model.device.type != "cuda":
    model = model.to("cuda")
print(f"Model loaded on device: {model.device}")
print(model)

The model is composed on modules of interconnected modules.

Model loaded on device: cuda:0
Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3SdpaAttention(
          (o_proj): Linear8bitLt(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear8bitLt(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3LongRoPEScaledRotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear8bitLt(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear8bitLt(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
      )
    )
    (norm): Phi3RMSNorm((3072,), eps=1e-05)
  )
  (lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)

The source code can be found here. To inspect the input and output of each module, we can register a function, called a hook, to be executed whenever a module computes an output:

# list of handles of registered hook
try:
    handles  # avoid overwriting if defined
except NameError:
    handles = []  # list of handles to the hook

# A hook to run every time after the forward method has computed an output
def hook(module, args, output):
    modules.append({"module": module, "args": args, "output": output})

# Register the forward hook to each module
modules = [*model.modules()]
for module in modules:
    handles.append(module.register_forward_hook(hook))

To generate the a new token and record the outputs of the modules:

# Initialize the list that stores the modules and their inputs and outputs
modules = []

# Tokenize input text and generate output
u = "A language model is"
encoding = tokenizer(u, return_tensors="pt")
# Use GPU if available
if torch.cuda.is_available() and encoding.input_ids.device.type != 'cuda':
    encoding = encoding.to("cuda")

# Generate response
with torch.no_grad():
    shat_ids = model.generate(**encoding, max_new_tokens=1)

# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat)

To display the modules:

@widgets.interact(i=widgets.Dropdown(
    options=[(str(i)+': '+repr(modules[i]['module']),i) for i in range(len(modules))],
    value=2,
    description='Module:',
    layout=widgets.Layout(width="90%")    
))
def show_module(i):
    def show_values(seq):
        for v in seq:
            print(' '*2 + str(type(v)) + (isinstance(v, torch.Tensor) and f' with shape: {v.shape}' or ''))
            show(v)
    print("Module:")
    show(modules[i]['module'])
    print("Input(s):")
    show_values(modules[i]['args'])
    print("Output(s):")
    outputs = modules[i]['output']
    if not isinstance(outputs, Iterable):
        outputs = [outputs]
    show_values(outputs)

To clean up the hooks so they do not get executed again:

# Clean up: Remove the added hook
for handle in handles:
    handle.remove()
handles = []

Logits in Final Layer

Recall that an auto-regressive language model uses the conditional distribution pxn+txt:n+t(xn+txt:n+t)p_{\R{x}_{n+t}|\R{x}_{t:n+t}}(x_{n+t}|x_{t:n+t}) to generate the new token xn+t\R{x}_{n+t} from an existing sequence xt:n+tx_{t:n+t} of tokens.

The last registered output of the model contains the logits

l:=[logpxn+txt:n+t(xn+txt:n+t)+c]xn+tX,\begin{align} l := \left[\log p_{\R{x}_{n+t}|\R{x}_{t:n+t}}(x_{n+t}|x_{t:n+t}) + c\right]_{x_{n+t}\in \mc{X}}, \end{align}

which are the log likelihood probabilities logpxn+txt:n+t(xt:n+t)\log p_{\R{x}_{n+t}|\R{x}_{t:n+t}}(\cdot|x_{t:n+t}) shifted by some constant cRc\in \mathbb{R}.

logits = modules[-1]["output"].logits
show(logits.cpu().numpy().tolist())

The logits is the output of the last layer of the neural network:

show_module(-2)

By default, the new token to generate is obtained by hardening the logits directly as follows. There is no need to compute the likelihood probabilities first.

next_token_id = torch.argmax(logits, dim=-1)
print(tokenizer.batch_decode(next_token_id)[0])
next_token_id, tokenizer.convert_ids_to_tokens(next_token_id)
Solution to Exercise 2

This is because log()+c\log(\cdot) + c is strictly increasing.

# YOUR CODE HERE
raise NotImplementedError
p
show(p.cpu().numpy().tolist())
# tests
# check whether p is stochastic.
assert ((0 <= p) & (p <= 1)).all() and (p.sum(dim=-1) == 1).all()
assert (torch.argmax(p, dim=-1) == torch.argmax(logits, dim=-1)).all()

Embedding in First Layer

The first layer of the model is called the embedding layer:

show_module(0)

The layer uses a pretrained embedding function to embed each token into a high-dimensional vector space:

gg is the embedding function that embeds a token into the dd-dimensional vector space. It is pretrained such that the distances between the embeddings of different tokens capture the differences in the meanings of the tokens. For instance, consider the embeddings of ‘dog’, ‘cat’, and ‘car’:

token_ids = torch.tensor(tokenizer.encode('dog cat car'))

with torch.no_grad():
    embeddings = modules[0]['module'].forward(token_ids)

embeddings

Dog is more similar to cat than to car using the cosine similarlity:

(a,b)Rd×Rdabab:=i[d]aibii[d]ai2i[d]bi2=cosθab\begin{align} (a,b)\in \mathbb{R}^{d}\times \mathbb{R}^{d} \mapsto \frac{ab^{\intercal}}{\norm{a}\norm{b}} &:= \frac{\sum_{i\in [d]} a_i b_i}{\sqrt{\sum_{i\in [d]} a_i^2} \sqrt{\sum_{i\in [d]} b_i^2}}\\ = \cos \theta_{ab} \end{align}

where θab\theta_{ab} is the angle between the vectors aa and bb.

# dog vs cat
torch.nn.functional.cosine_similarity(embeddings[0], embeddings[1], dim=-1)
# dog vs car
torch.nn.functional.cosine_similarity(embeddings[0], embeddings[2], dim=-1)

Note that cosine_similarity is a universal function, so we can compute the similarities in one go:

torch.nn.functional.cosine_similarity(embeddings[0], embeddings, dim=-1)
# YOUR CODE HERE
raise NotImplementedError
similarity_matrix
# tests
assert (
    similarity_matrix.cpu() ** 2
    - torch.tensor(
        [[1.0000, 0.0533, 0.0157], [0.0533, 1.0000, 0.0177], [0.0157, 0.0177, 1.0000]]
    ).abs()
    < 1e-4
).all()

Attention Mechanism

An important component of the transformer architecture is the attention mechanism proposed by Vaswani et al. 2017, which can be efficiently trained and computed to focus on the most relevant parts of the context, which can be longer than those of the traditional sequential architectures such as RNN and LSTM.

For the current mode, there is a stack 32 decoder modules, each of which implements the attention mechanism.

show_module(5)

The attention function is defined as follows and the source code of the attention module can be found here.

The query QQ, key KK, and value VV are derived from the input embeddings XX, as in μW(X,X,X)\mu_{W}(X, X, X). Specifically, QQ and KK are used to calculate attention scores, while VV holds the actual values to be attended to based on these scores.

Note that the different rows of QQ go through the same linear transformations, i.e., the position information is immaterial in the calculation of the attention score. In order to weight the importance of a token in the context based on its position relative the new token to be generated, an additional positional encoding is needed.

show_module(3)

An example is the rotary positional encoding (not the one used above) proposed by Su et al. 2021 is as follows:

ZZ or its linear transformations is passed to the attention function as in μW(Z,Z,Z)\mu_{W}(Z, Z, Z).

References
  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. arXiv. 10.48550/ARXIV.1706.03762
  2. Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv. 10.48550/ARXIV.2104.09864