from __init__ import install_dependencies, show
await install_dependencies()
import os
from IPython.display import JSON
import transformers as tfm
import torch
Problem Formulation¶
What is a language model?
From the wiki page:
A language model is a probabilistic model of a natural language.
To put it simply, a causal language model completes an input prompt such as
A language model is ...
into a realistic text like the one from the wiki page. More formally:
The above proof relies on the strict convexity of , i.e.:
YOUR ANSWER HERE
Tokenization¶
Just like we compose a text using words from a vocabulary, a language model also generates a text from a vocabulary consisting of meaningful units called tokens.
The following code creates a tokenizer from the configuration files under model_path
using AutoTokenizer.from_pretrained
:
# Load the tokenizer
model_path = "/models/hf/Phi-3.5-mini-instruct/"
tokenizer = tfm.AutoTokenizer.from_pretrained(model_path)
show(tokenizer)
The configurations of the tokenizer is specified in JSON format, which is a collection of key/value pairs where the keys are names given as strings:
JSON(filename=os.path.join(model_path, "tokenizer_config.json"))
JSON(filename=os.path.join(model_path, "special_tokens_map.json"))
JSON(filename=os.path.join(model_path, "tokenizer.json"))
To encode and decode a text using the tokenizer:
text = "A language model is a probabilistic model of a natural language."
ids = tokenizer.encode(text)
decoded_text = tokenizer.decode(ids)
assert text == decoded_text
ids
show(tokenizer.encode)
show(tokenizer.decode)
For efficient implementation of encode
and decode
, tokens are represented by integers known as the token ID’s. The mapping from ID’s to tokens is provided by the dictionary below:
show(tokenizer.vocab)
To obtain the tokens from ID’s, we can use the method convert_ids_to_tokens
:
tokens = tokenizer.convert_ids_to_tokens(ids)
tokens
def reverse_dict(d):
# YOUR CODE HERE
raise NotImplementedError
# tests
reversed_vocab = reverse_dict(tokenizer.vocab)
assert tokens == [reversed_vocab[i] for i in ids]
Note that a token needs not be an English word. For instance:
tokens[5], tokens[6], tokens[-1]
The tokens can be punctuations such as .
and even subwords that are meaningful by itself such as ▁probabil
and isstic
.
YOUR ANSWER HERE
Generation¶
A language model generates a text one token at a time just like we speak a text word-by-word. The model is probabilistic in the sense each token is generated randomly according some distribution. The sequence of randomly generated tokens is called a stochastic/random process. If each token is generated independently based on some previously generated tokens, the process is said to be auto-regressive.
A potential confusion is to think that a token cannot depend on other tokens outside the context.
YOUR ANSWER HERE
If is a sequence of tokens shorter (longer) than the context length , it can be left-padded (left-truncated) by the tokenizer:
text = "A language model is a probabilistic model of a natural language."
encoding = tokenizer(text, padding='max_length', truncation=True)
encoding.keys(), len(encoding.input_ids), len(encoding.attention_mask)
show(tokenizer.__call__)
The above call to tokenizer
returns a dictionary consisting of two lists, both with the same length as the context length:
tokenizer.model_max_length
input_ids
points to the list of token ID’s:
show(encoding.input_ids)
Note the input_ids
is left-padded by the padding token ID:
tokenizer.pad_token_id, tokenizer.pad_token
Intuitively, the padding tokens should not be used to generate new tokens. To avoid unnecessary computations, the attention mask explicitly gives 0 attention/importance/weight to those special tokens:
show(encoding.attention_mask)
There are also other special tokens that should normally be masked off:
tokenizer.special_tokens_map_extended
To load a language model:
bnb_config = tfm.BitsAndBytesConfig(load_in_8bit=True)
model = tfm.AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=bnb_config,
low_cpu_mem_usage=True,
)
# Use GPU if available
if torch.cuda.is_available() and model.device.type != "cuda":
model = model.to("cuda")
print(f"Model loaded on device: {model.device}")
print(model)
A language model is a type of neural network consisting of layers of computational units called neurons. To generate text quickly, the above code attempts to utilize a Graphics Processing Unit (GPU) whenever available. It further quantizes the model to a lower precision, specifically 8-bit instead of the original 16-bit, to reduce the memory footprint.
Finally, to generate the text, run the following cell:
# Tokenize input text and generate output
u = "A language model is"
encoding = tokenizer(u, return_tensors="pt")
# Use GPU if available
if torch.cuda.is_available() and encoding.input_ids.device.type != 'cuda':
encoding = encoding.to("cuda")
# Generate response
with torch.no_grad():
shat_ids = model.generate(**encoding, max_length=100)
# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat)
Note that repeatedly runing the above code will generate the same text. This is because, instead of sampling from the distribution in (12), it makes a hard decision:
To perform the sampling, we can pass the keyword argument do_sample=True
to model.generate
as follows:
# Generate response
with torch.no_grad():
shat_ids = model.generate(**encoding, max_length=100, do_sample=True)
# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat)
Verify that the code generates the tokens randomly by running it repeatedly.
The generated text might have been cut off in the middle of a sentence. Although you can increase max_length
to a sufficiently large value to ensure that generation terminates with an end-of-sequence token (eos_token
), this can result in excessively long outputs. Fortunately, there are other stopping criteria implemented that can help control the length and content of the generated text more effectively.
# Assign the desired stopping criteria to `stopping_criteria`.
# YOUR CODE HERE
raise NotImplementedError
stopping_criteria
# Generate response
with torch.no_grad():
shat_ids = model.generate(**encoding,
max_length=2000, # Make this big as the default is 20
stopping_criteria=stopping_criteria
)
# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat.strip())
# test
assert "A language model is" in shat
assert "\n" not in shat.strip()
Chat Completion¶
A language model can also be trained to complete a chat, following the ChatGPT. The Chat Completion API. A chat can be represented as a list of chat messages:
chat = [
{"role": "system", "content": "You are an AI engineer who knows language models so well that you can explain the theory to a first-year undergraduate without any background."},
{"role": "user", "content": "What is a language model?"}
]
Each message is associated with a role:
- The
system
message sets the behavior for the AI assistant. - The
user
message represents the user’s query.
The tokenizer can be used to convert the list of messages into a single text for the language model to complete in the same way as before:
# Apply the chat template
formatted_chat = tokenizer.applx_chat_template(chat, tokenize=False, add_generation_prompt=True)
print("Formatted chat:\n", formatted_chat)
Note that <|system|>
, <|user|>
, <|assistant|>
, and <|end|>
are special tokens used to mark the different chat messages. The chat template can be printed as follows:
chat_template = tokenizer.get_chat_template()
print("Chat template:\n", chat_template)
This is a Jinja template, which uses python programming syntax such as iterations and conditionals to render the text from an input list messages
of dictionaries.
We can now call the language model to complete the text as before:
# Tokenize input text and generate output
u = formatted_chat
encoding = tokenizer(u, return_tensors="pt")
# Use GPU if available
if torch.cuda.is_available() and encoding.input_ids.device.type != 'cuda':
encoding = encoding.to("cuda")
# Generate response
with torch.no_grad():
shat_ids = model.generate(**encoding, max_length=200)
# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat)
def decode_chat_messages(ids):
roles = {32006: "system", 32010: "user", 32001: "assistant"}
output = []
# YOUR CODE HERE
raise NotImplementedError
return output
# tests
generated_text = """
<|system|> You are an AI engineer who knows language models so well that you can explain the theory to a first-year undergraduate without any background.<|end|>
<|user|> What is a language model?<|end|>
<|assistant|> A language model is a type of artificial intelligence (AI) system that is designed to understand, interpret, and generate human language. It is a mathematical representation of how words and phrases are likely to occur in a given language. Language models are used in various applications, such as speech recognition, machine translation, text generation, and natural language processing (NLP).
"""
assert decode_chat_messages(tokenizer.encode(generated_text)) == [
{
"role": "system",
"content": "You are an AI engineer who knows language models so well that you can explain the theory to a first-year undergraduate without any background.",
},
{"role": "user", "content": "What is a language model?"},
{
"role": "assistant",
"content": "A language model is a type of artificial intelligence (AI) system that is designed to understand, interpret, and generate human language. It is a mathematical representation of how words and phrases are likely to occur in a given language. Language models are used in various applications, such as speech recognition, machine translation, text generation, and natural language processing (NLP).\n",
},
]