Skip to article frontmatterSkip to article content
from __init__ import install_dependencies

await install_dependencies()
%reload_ext divewidgets

Python is a popular tool among hackers and engineers. In this lab, you will learn Cryptology in cybersecurity, which covers:

Symmetric key cipher

We first implement a simple cipher known as the Caesar cipher. Named after Julius Caesar, who used it to protect his private correspondence, this cipher involves shifting letters in the alphabet by a fixed number of positions:

The Caesar cipher

Encrypt/decrypt a character

How to encrypt a character?

The following code encrypts a character char using a non-negative integer key.

cc_n = 1114112


def cc_encrypt_character(char, key):
    """
    Return the encryption of a character by an integer key using Caesar cipher.

    Parameters
    ----------
    char: str
        a unicode (UTF-8) character to be encrypted.
    key int:
        secret key to encrypt char.
    """
    char_code = ord(char)
    shifted_char_code = (char_code + key) % cc_n
    encrypted_char = chr(shifted_char_code)
    return encrypted_char

For example, to encrypt the letter 'A' using a secret key 5:

cc_encrypt_character("A", 5)

The character 'A' is encrypted to the character 'F' as follows:

  1. ord(char) return the integer 65, which is the code point (integer representation) of the unicode of 'A'.
  2. (char_code + key) % cc_n cyclic shifts the code by the key 5.
  3. chr(shifted_char_code) converts the shifted code back to a character, which is 'F'.
Encryption steps
char...ABCDEF...
ord(char)...656667686970...
(ord(char) + key) % cc_n...707172737475...
chr((ord(char) + key) % cc_n)...FGHIJK...

You may learn more about ord and chr from their docstrings:

help(ord)
help(chr)

How to decrypt a character?

Mathematically, we define the encryption and decryption of a character for Caesar cipher as

fk(x):=x+kmodn(encryption)gk(y):=ykmodn(decryption),\begin{align} f_k(x) &:= x + k \mod n & \text{(encryption)} \\ g_k(y) &:= y - k \mod n & \text{(decryption),} \end{align}

where xx is the character code and kk is the secret key, both of which are in {0,,n1}\{0,\dots,n-1\}. mod operator above is the modulo operator. In Mathematics, it has a lower precedence than addition and multiplication and is typeset with an extra space accordingly.

The encryption and decryption satisfy the recoverability condition

gk(fk(x))=xg_k(f_k(x)) = x

so two people with a common secret key can encrypt and decrypt a character, but others without the key cannot. This defines a symmetric cipher.

The following code decrypts a character using a key.

def cc_decrypt_character(char, key):
    """
    Return the decryption of a character by the key using Caesar cipher.

    Parameters
    ----------
    char: str
        a unicode (UTF-8) character to be decrypted.
    key: int
        secret key to decrypt char.
    """
    char_code = ord(char)
    shifted_char_code = (char_code - key) % cc_n
    decrypted_char = chr(shifted_char_code)
    return decrypted_char

For instance, to decrypt the letter 'F' by the secret key 5:

cc_decrypt_character("F", 5)

The character 'F' is decrypted back to 'A' because (char_code - key) % cc_n reverse cyclic shifts the code by the key 5.

Encryption stepsDecryption steps
char...ABCDEF...chr((ord(char) - key) % cc_n)
ord(char)...656667686970...(ord(char) - key) % cc_n
(ord(char) + key) % cc_n...707172737475...ord(char)
chr((ord(char) + key) % cc_n)...FGHIJK...char

YOUR ANSWER HERE

Encrypt a plaintext and decrypt a ciphertext

Of course, it is more interesting to encrypt a string instead of a character. The following code implements this in one line.

def cc_encrypt(plaintext, key):
    """
    Return the ciphertext of a plaintext by the key using the Caesar cipher.

    Parameters
    ----------
    plaintext: str
        A unicode (UTF-8) message to be encrypted.
    public_key: int
        Public key to encrypt plaintext.
    """
    return "".join([chr((ord(char) + key) % cc_n) for char in plaintext])

The above function encrypts a message, referred to as the plaintext, by replacing each character with its encryption. This is referred to as a substitution cipher.

def cc_decrypt(ciphertext, key):
    """
    Return the plaintext that encrypts to ciphertext by the key using Caesar cipher.

    Parameters
    ----------
    ciphertext: str
        message to be decrypted.
    key: int
        secret key to decrypt the ciphertext.
    """
    # YOUR CODE HERE
    raise NotImplementedError
Source
# tests
assert cc_decrypt(r"bcdefghijklmnopqrstuvwxyz{", 1) == "abcdefghijklmnopqrstuvwxyz"
assert cc_decrypt(r"Mjqqt1%\twqi&", 5) == "Hello, World!"
Source
# hidden tests

Another symmetric key cipher is the columnar transposition cipher. A transposition cipher encrypts a text by permuting instead of substituting characters.

# YOUR CODE HERE
raise NotImplementedError
Source
# tests
key = "ZEBRAS"
plaintext = "WEAREDISCOVEREDFLEEATONCE"
ciphertext = "EVLNACDTESEAROFODEECWIREE"
assert ct_encrypt(plaintext, key) == ciphertext
assert ct_decrypt(ciphertext, key) == plaintext
Source
# hidden tests

Brute-force attack

A result from the work of Claude Shannon, the Father of Information Theory, work is that the security of a secrecy system is fundamentally linked to the length of the key. A brute-force attack is a method of defeating a cryptographic system by systematically trying all possible keys until the correct one is found. The feasibility of a brute-force attack depends on the length of the key and the computational power available to the attacker.

Create an English dictionary

You will perform a brute-force attack to guess the key that encrypts an English text. The process is straightforward:

  • Attempt to decrypt the ciphertext using various keys.
  • Evaluate which of the resulting plaintexts is the most English-like.

To check whether a plaintext is English-like, we need to have a list of English words. One way is to type them out but this is tedious. Alternatively, we can obtain the list from the Natural Language Toolkit (NLTK):

%pip install nltk
import nltk

nltk.download("words")
from nltk.corpus import words

words.words() returns a list of words. We can check whether a string is in the list using the operator in.

for word in "Ada", "ada", "Hello", "hello":
    print("{!r} in dictionary? {}".format(word, word in words.words()))

However, there are two issues:

  • Checking membership is slow for a long list.
  • Both ‘Hello’ and ‘ada’ are English-like but not in the words list.
# YOUR CODE HERE
raise NotImplementedError
Source
# tests
assert isinstance(dictionary, set) and len(dictionary) == 234377
assert all(word in dictionary for word in ("ada", "hello"))
assert all(word not in dictionary for word in ("Ada", "hola"))
Source
# hidden tests

Identify English-like text

To determine how English-like a text is, we calculate the following score:

number of English words in the textnumber of tokens in the text\frac{\text{number of English words in the text}}{\text{number of tokens in the text}}

where tokens are substrings, not necessarily English words, separated by white space characters in the text.

def tokenizer(text):
    """Returns the list of tokens of the text."""
    return text.split()


def get_score(text):
    """Returns the fraction of tokens which appear in dictionary."""
    tokens = tokenizer(text)
    words = [token for token in tokens if token in dictionary]
    return len(words) / len(tokens)


# tests
get_score("hello world"), get_score("Hello, World!")

As shown in the tests above, the code fails to handle text with punctuations and uppercase letters properly.
In particular,

  • while get_score recognizes hello world as English-like and returns the maximum score of 1,
  • it fails to recognize Hello, World! as English-like and returns the minimum score of 0.

Why? Every word in dictionary

  • are in lowercase, and
  • have no leading/trailing punctuations.
import string


def tokenizer(text):
    """Returns the list of tokens of the text such that
    1) each token has no leading or trailing spaces/punctuations, and
    2) all letters in each token are in lowercase."""
    # YOUR CODE HERE
    raise NotImplementedError
Source
# tests
assert tokenizer("Hello, World!") == ["hello", "world"]
assert get_score("Hello, World!") >= 0.99999
assert tokenizer("Do you know Jean-Pierre?") == ["do", "you", "know", "jean-pierre"]
assert get_score("Do you know Jean-Pierre?") >= 0.99999
Source
# hidden tests

Launch a brute-force attack

def cc_attack(ciphertext, threshold=0.6):
    """Returns a generator that generates the next guess of the key that
    decrypts the ciphertext to a text with get_score(text) at least the threshold.
    """
    # YOUR CODE HERE
    raise NotImplementedError
Source
# tests
ciphertext = cc_encrypt("Hello, World!", 12345)
key_generator = cc_attack(ciphertext)
key_guess = next(key_generator)
assert key_guess == 12345
text = cc_decrypt(ciphertext, key_guess)
print("guess of the key: {}\nscore: {}\ntext :{}".format(key_guess, get_score(text), text))
Source
# hidden tests
References
  1. Shannon, C. E. (1949). Communication Theory of Secrecy Systems*. Bell System Technical Journal, 28(4), 656–715. 10.1002/j.1538-7305.1949.tb00928.x