from __init__ import install_dependencies, show
await install_dependencies()
import os
from IPython.display import JSON
import ipywidgets as widgets
from collections.abc import Iterable
import transformers as tfm
import torch
The success of large language models (LLM) can be attributed to
- the advancement of computing devices, such as Graphics Processing Units (GPUs),
- the availability of a large corpus of data, and
- the development of deep learning architectures and techniques
that make it computationally feasible to train sophisticated language models on a large amount of data. In this notebook, we will introduce basic achitecture of LLM, which can be trained to capture important information for generating text.
Neural Network¶
Let’s visualize the training process! Click the play button (▶) below to train a neural network that predicts the color of a point :
Figure 1:A single neuron for a linearly separable dataset. (open in new tab)
The output of the neural network is plotted as a heatmap:
- The blue region is the decision region for classifying a point to be positive/blue.
- The orange region is the decision region for classifying a point to be negative/orange.
- The white line is a decision boundary separating the decision regions.
The process of computing the output of a neural network is called the forward propagation:
For accurate prediction, the neural network is trained by examples called the training set to minimize an objective function called the loss function:
The optimization is done step-by-step using a numerical method called the gradient descent:
Traing speed and stability depends on the choice of the activation functi9n. Another common non-linear activation function is ReLU (Rectified Linear Unit) defined as
ReLU makes training faster as the slope does not diminish at the tails of the function. Sigmoid makes training more stable as it is smooth. SiReLU (Sigmoid-Rectified Linear Unit) is another activation function that combine the benefits of both. It can be defined as
To classify more complicated data, a neural network can have more neurons arranged in modules:
Figure 2:A more complex neural network and dataset. (open in new tab)
YOUR ANSWER HERE
What is a neural network?
Figure 3:What is a neural network? (open in new tab)
Transformer¶
Recall the following code which
- creates a tokenizer from the configuration files under
model_path
usingAutoTokenizer.from_pretrained
, and - load the language model, using GPU whenever possible, and quantizes it to 8-bit for each parameter to reduce the memory footprint.
# Load the tokenizer
model_path = "/models/hf/Phi-3.5-mini-instruct/"
tokenizer = tfm.AutoTokenizer.from_pretrained(model_path)
# Load the model
bnb_config = tfm.BitsAndBytesConfig(load_in_8bit=True)
model = tfm.AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=bnb_config,
low_cpu_mem_usage=True,
)
# Use GPU if available
if torch.cuda.is_available() and model.device.type != "cuda":
model = model.to("cuda")
print(f"Model loaded on device: {model.device}")
print(model)
The model is composed on modules of interconnected modules.
Model loaded on device: cuda:0
Phi3ForCausalLM(
(model): Phi3Model(
(embed_tokens): Embedding(32064, 3072, padding_idx=32000)
(embed_dropout): Dropout(p=0.0, inplace=False)
(layers): ModuleList(
(0-31): 32 x Phi3DecoderLayer(
(self_attn): Phi3SdpaAttention(
(o_proj): Linear8bitLt(in_features=3072, out_features=3072, bias=False)
(qkv_proj): Linear8bitLt(in_features=3072, out_features=9216, bias=False)
(rotary_emb): Phi3LongRoPEScaledRotaryEmbedding()
)
(mlp): Phi3MLP(
(gate_up_proj): Linear8bitLt(in_features=3072, out_features=16384, bias=False)
(down_proj): Linear8bitLt(in_features=8192, out_features=3072, bias=False)
(activation_fn): SiLU()
)
(input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
(resid_attn_dropout): Dropout(p=0.0, inplace=False)
(resid_mlp_dropout): Dropout(p=0.0, inplace=False)
(post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
)
)
(norm): Phi3RMSNorm((3072,), eps=1e-05)
)
(lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)
The source code can be found here. To inspect the input and output of each module, we can register a function, called a hook, to be executed whenever a module computes an output:
# list of handles of registered hook
try:
handles # avoid overwriting if defined
except NameError:
handles = [] # list of handles to the hook
# A hook to run every time after the forward method has computed an output
def hook(module, args, output):
modules.append({"module": module, "args": args, "output": output})
# Register the forward hook to each module
modules = [*model.modules()]
for module in modules:
handles.append(module.register_forward_hook(hook))
To generate the a new token and record the outputs of the modules:
# Initialize the list that stores the modules and their inputs and outputs
modules = []
# Tokenize input text and generate output
u = "A language model is"
encoding = tokenizer(u, return_tensors="pt")
# Use GPU if available
if torch.cuda.is_available() and encoding.input_ids.device.type != 'cuda':
encoding = encoding.to("cuda")
# Generate response
with torch.no_grad():
shat_ids = model.generate(**encoding, max_new_tokens=1)
# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat)
To display the modules:
@widgets.interact(i=widgets.Dropdown(
options=[(str(i)+': '+repr(modules[i]['module']),i) for i in range(len(modules))],
value=2,
description='Module:',
layout=widgets.Layout(width="90%")
))
def show_module(i):
def show_values(seq):
for v in seq:
print(' '*2 + str(type(v)) + (isinstance(v, torch.Tensor) and f' with shape: {v.shape}' or ''))
show(v)
print("Module:")
show(modules[i]['module'])
print("Input(s):")
show_values(modules[i]['args'])
print("Output(s):")
outputs = modules[i]['output']
if not isinstance(outputs, Iterable):
outputs = [outputs]
show_values(outputs)
To clean up the hooks so they do not get executed again:
# Clean up: Remove the added hook
for handle in handles:
handle.remove()
handles = []
Logits in Final Layer¶
Recall that an auto-regressive language model uses the conditional distribution to generate the new token from an existing sequence of tokens.
The last registered output of the model contains the logits
which are the log likelihood probabilities shifted by some constant .
logits = modules[-1]["output"].logits
show(logits.cpu().numpy().tolist())
The logits is the output of the last layer of the neural network:
show_module(-2)
By default, the new token to generate is obtained by hardening the logits directly as follows. There is no need to compute the likelihood probabilities first.
next_token_id = torch.argmax(logits, dim=-1)
print(tokenizer.batch_decode(next_token_id)[0])
next_token_id, tokenizer.convert_ids_to_tokens(next_token_id)
Solution to Exercise 2
This is because is strictly increasing.
# YOUR CODE HERE
raise NotImplementedError
p
show(p.cpu().numpy().tolist())
# tests
# check whether p is stochastic.
assert ((0 <= p) & (p <= 1)).all() and (p.sum(dim=-1) == 1).all()
assert (torch.argmax(p, dim=-1) == torch.argmax(logits, dim=-1)).all()
Embedding in First Layer¶
The first layer of the model is called the embedding layer:
show_module(0)
The layer uses a pretrained embedding function to embed each token into a high-dimensional vector space:
is the embedding function that embeds a token into the -dimensional vector space. It is pretrained such that the distances between the embeddings of different tokens capture the differences in the meanings of the tokens. For instance, consider the embeddings of ‘dog’, ‘cat’, and ‘car’:
token_ids = torch.tensor(tokenizer.encode('dog cat car'))
with torch.no_grad():
embeddings = modules[0]['module'].forward(token_ids)
embeddings
Dog is more similar to cat than to car using the cosine similarlity:
where is the angle between the vectors and .
# dog vs cat
torch.nn.functional.cosine_similarity(embeddings[0], embeddings[1], dim=-1)
# dog vs car
torch.nn.functional.cosine_similarity(embeddings[0], embeddings[2], dim=-1)
Note that cosine_similarity
is a universal function, so we can compute the similarities in one go:
torch.nn.functional.cosine_similarity(embeddings[0], embeddings, dim=-1)
# YOUR CODE HERE
raise NotImplementedError
similarity_matrix
# tests
assert (
similarity_matrix.cpu() ** 2
- torch.tensor(
[[1.0000, 0.0533, 0.0157], [0.0533, 1.0000, 0.0177], [0.0157, 0.0177, 1.0000]]
).abs()
< 1e-4
).all()
Attention Mechanism¶
An important component of the transformer architecture is the attention mechanism proposed by Vaswani et al. 2017, which can be efficiently trained and computed to focus on the most relevant parts of the context, which can be longer than those of the traditional sequential architectures such as RNN and LSTM.
For the current mode, there is a stack 32 decoder modules, each of which implements the attention mechanism.
show_module(5)
The attention function is defined as follows and the source code of the attention module can be found here.
The query , key , and value are derived from the input embeddings , as in . Specifically, and are used to calculate attention scores, while holds the actual values to be attended to based on these scores.
Note that the different rows of go through the same linear transformations, i.e., the position information is immaterial in the calculation of the attention score. In order to weight the importance of a token in the context based on its position relative the new token to be generated, an additional positional encoding is needed.
show_module(3)
An example is the rotary positional encoding (not the one used above) proposed by Su et al. 2021 is as follows:
or its linear transformations is passed to the attention function as in .
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. arXiv. 10.48550/ARXIV.1706.03762
- Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv. 10.48550/ARXIV.2104.09864