Skip to article frontmatterSkip to article content

Abstract

Composite data types, like tuple and lists, allow programmers to group multiple objects together efficiently. This notebook focuses on sequence types, where objects are ordered. Readers will learn how to construct sequences using enclosure and comprehension; how to access items using subscriptions and slicing; the concept of mutation; and Methods that operate on sequences, including those that cause mutations. Understanding these concepts helps in managing collections of data more effectively, leading to cleaner, more maintainable, and scalable code.

from __init__ import install_dependencies

await install_dependencies()

Motivation

The following code calculates the average of two numbers:

def average_of_two(x0, x1):
    return (x0 + x1) / 2


average_of_two(0, 1)

How to calculate the average of more numbers? For instance, the average of four numbers 1, 2, 3, 4 is:

average_of_two(average_of_two(0, 1), average_of_two(2, 3))

But what about 5 numbers 0, 1, 2, 3, 4?

average_of_two(average_of_two(average_of_two(0, 1), average_of_two(2, 3)), 4)

Repeatedly applying the function does not always work. It is also not impossible to specify an arbitrary number of optional arguments:

def average(x0, x1=None, x2=None, x3=None, ...):
    ...

What is needed is a composite data type (or container):

def average(*args):
    return sum(args) / len(args)


average(0, 1, 2, 3, 4)

Recall that args is a tuple that can keep a variable number of items in a sequence.

In calculating the average, sum and len return the sum and length of an iterable. There are also other built-in functions that can apply to an iterable directly:

min, max, sorted, enumerate, reversed, zip, map, filter, slice

We can do this for average as well:

def average(seq):
    return sum(seq) / len(seq)


seq = range(100)
average(seq)
max?
seq = (0, 1, 2, 3, 4)
max(seq), max(*seq)
from collections.abc import Iterable


def average(*args):
    ### BEGIN SOLUTION
    if len(args) == 1 and isinstance(args[0], Iterable):
        args = args[0]
    return sum(args) / len(args)
    ### END SOLUTION
# tests
assert average(seq) == 2 == average(*seq)

Construction

How to store a sequence of items?

We created objects of sequence types before:

  • str is used to store a sequence of characters, but the items are limited to characters.
  • range is used to generate a sequence of numbers, but the numbers must form an arithmetic sequence.

In order to store items of possibly different types, we can use the built-in types tuple and list:

%%optlite -l -h 400
a_list = "1 2 3".split()
a_tuple = (lambda *args: args)(1, 2, 3)

How to create a tuple/list?

Mathematicians often represent a collection of items in two different ways:

  1. Roster notation, which enumerates the elements, e.g.,

    {0,1,4,9,16,25,36,49,64,81}.\{0, 1, 4, 9, 16, 25, 36, 49, 64, 81\}.
  2. Set-builder notation, which describes the content using a rule for constructing the elements, e.g.,

    {x2xN,x<10},\{x^2| x\in \mathbb{N}, x< 10 \},

    namely the set of perfect squares strictly less than 100, which is the same as (1). N\mathbb{N} denotes the set of natural numbers (including 0).

Python also provides two corresponding ways to create a collection of items:

  1. Enclosure, which uses brackets to group elements together.
  2. Comprehension, which uses concise syntax similar to iterations and conditionals to generate elements.
%%ai chatgpt -f text
What is the proper way to write a sequence in set-builder notations.
%%ai chatgpt -f math
List some mathmatical symbols use for common sets of numbers.

Enclosure

For instance, to create a tuple, we enclose a comma separated sequence of values by parentheses:

%%optlite -h 450
empty_tuple = ()
singleton_tuple = (0,)   # why not (0)?
heterogeneous_tuple = (
    singleton_tuple, (1, 2.0), 
    print
)
enclosed_starred_tuple = (
    *range(2), 
    *"23"
)

Note from the above code that:

  • 2nd assignment: If the enclosed sequence has one term, there must be a comma after the term.
  • 3rd assignment: The elements of a tuple can have different types.
  • 4th assignment: The unpacking operator * can unpack an iterable into a sequence in an enclosure.

To create a list, we use square brackets instead of parentheses to enclose objects.

%%optlite -h 400
empty_list = []
singleton_list = [0]  # no need to write [0,]
heterogeneous_list = [
    singleton_list, 
    (1, 2.0), 
    print
]
enclosed_starred_list = [
    *range(2), 
    *"23"
]

We can also create a tuple/list from other iterables using the constructors tuple/list as well as addition and multiplication similar to str.

%%optlite -l -h 900
str2list = list("Hello")
str2tuple = tuple("Hello")
range2list = list(range(5))
range2tuple = tuple(range(5))
tuple2list = list((1, 2, 3))
list2tuple = tuple([1, 2, 3])
concatenated_tuple = (1,) + (2, 3)
concatenated_list = [1, 2] + [3]
duplicated_tuple = (1,) * 2
duplicated_list = 2 * [1]
print((1 + 2) * 2, (1 + 2,) * 2, sep="\n")
Solution to Exercise 2

(1+2)*2 evaluates to 6 but (1+2,)*2 evaluates to (3,3).

  • The parentheses in (1+2) indicate the addition needs to be performed first, but
  • the parentheses in (1+2,) creates a tuple.

Hence, singleton tuple must have a comma after the item to differentiate these two use cases.

%%ai chatgpt -f text
In Python, why a singleton tuple must have a comma after the item?

Comprehension

How to use a rule to construct a tuple/list?

We can define the rules for constructing a sequence using a comprehension, a technique we’ve previously applied in a generator expression. For example, the following Python one-liner returns a generator for prime numbers:

def prime_sequence(stop):
    return (x for x in range(2, stop) if
            all(x % d for d in range(2, isqrt(x) + 1)))


print(*prime_sequence(100))

There are two comprehensions used in the return value:

  1. (x for x in range(2, stop) if ...): The comprehension creates a generator of numbers from 2 to stop-1 that satisfy the condition of the if clause.
  2. (x % d for d in range(2, isqrt(x) + 1)): The comprehension creates a generator of remainders to the function all, which returns True if all the remainders are non-zero else False.
### BEGIN SOLUTION
def composite_sequence(stop):
    return (x for x in range(2, stop) if any(x % d == 0 for d in range(2, x)))


### END SOLUTION

print(*composite_sequence(100))

Comprehension can also be used to construct a list instead of a generator. An example of list comprehension is as follows:

[x ** 2 for x in range(10)]  # Enclose comprehension by square brackets
%%timeit
tuple(x for x in range(100))
%%timeit
tuple([x for x in range(100)])
%%timeit
sum(x for x in range(10000))
%%timeit
sum([x for x in range(10000)])
Solution to Exercise 4

It appears so from the above tests.

Do you think the AI can predict which is faster? Why or why not?

%%ai chatgpt -f text
Explain whether it is faster to iterate through elements of a list comprehension 
than those of a generator in the following Python code:
--
sum(x for x in range(10000))
--
sum([x for x in range(10000)])

As a demonstration of list comprehension, consider simulating the coin tossing game:

With list comprehension, we can easily simulate a sequence of biased coin flips as follows:

from random import random as rand

p = 1302/10000  # unknown chance of head
coin_flips = ["H" if rand() <= p else "T" for i in range(1000000)]
print("Chance of head:", p)
print("Coin flips:", *coin_flips[:100], "...")

p should be kept secret, while coin_flips can be shown to the player:

  • H means a head comes up, and
  • T means a tail comes up.
head_indicators = [1 if outcome == "H" else 0 for outcome in coin_flips]
phat = average(head_indicators)
print("Fraction of heads observed:", phat)

Does the estimate look reasonable. How accurate is this estimate?

Let’s formulate the problem mathematically. Denote the total number of coin flips by nn. For 1in1\leq i\leq n, define

xi:={1if a head comes up in the i-th coin-flip,0otherwise,x_i := \begin{cases} 1 & \text{if a head comes up in the $i$-th coin-flip,}\\ 0 & \text{otherwise,} \end{cases}

which is called an indicator variable.

The estimate above can be expressed in terms of nn and xix_i’s as follows:

p^:=i=1nxin,\hat{p} := \frac{\sum_{i=1}^n x_i}{n},

namely, the sample average of xix_i’s. (Why?) This is an example of an M-estimator.

The variation of the estimate can be calculated from the sample variance:[1]

v:=i=1n(xip^)2n=(1ni=1nxi2)p^2.\begin{align} v &:= \frac{\sum_{i=1}^n (x_i- \hat{p})^2}{n} \\ &= \left(\frac1n\sum_{i=1}^n x_i^2\right) - \hat{p}^2. \end{align}

Except for a small chance of 5%5\%,

pp^±2vn,p \approx \hat{p} \pm 2\sqrt{\frac{v}{n}},

which is called the 95%95\%-confidence interval estimate.[2]

def variance(seq):
    ### BEGIN SOLUTION
    return (
        (sum(i**2 for i in seq) / len(seq) - average(seq) ** 2)
    )
    ### END SOLUTION

v = variance(head_indicators)
n = len(head_indicators)

delta = 2 * (v / n) ** 0.5
print(f"p \u2248 {phat:.4f} \u00B1 {delta:.4f} except for 5% chance.")
print(
    "95% confidence interval estimate of p: [{:.4f},{:.4f}]".format(
        phat - delta, phat + delta
    )
)

There is a simpler way to calculate the variance for coin tosses, which follows a Bernoulli distribution:

v = phat * (1 - phat)
print(f"p \u2248 {phat:.4f} \u00B1 {2*(v/n)**0.5:.4f}")
%%ai chatgpt -f text
Explain the formula for the variance of samples of Bernoulli variables.

Operations

Selection

How to traverse a tuple/list?

Instead of calling the dunder method directly, we can use a for loop to iterate over all the items in order.

a = (*range(5),)
for item in a:
    print(item, end=" ")

To do it in reverse, we can use the reversed function.

reversed?
a = [*range(5)]
for item in reversed(a):
    print(item, end=" ")

We can also traverse multiple tuples/lists simultaneously by zipping them.

zip?
a = (*range(5),)
b = reversed(a)
for item1, item2 in zip(a, b):
    print(item1, item2)

How to select an item in a sequence?

We can select an item of a sequence a by subscription

a[i]

where a is a list and i is an integer index.

A non-negative index indicates the distance from the beginning.

a=(a0,...,an1)\boldsymbol{a} = (a_0, ... , a_{n-1})
%%optlite -h 500
a = (*range(10),)
print(a)
print("Length:", len(a))
print("First element:", a[0])
print("Second element:", a[1])
print("Last element:", a[len(a) - 1])
print(a[len(a)])  # IndexError

A negative index represents a negative offset from an imaginary element one past the end of the sequence.

a=(a0,...,an1)=(an,...,a1)\begin{aligned} \boldsymbol{a} &= (a_0, ... , a_{n-1})\\ & = (a_{-n}, ..., a_{-1}) \end{aligned}
%%optlite -h 500
a = [*range(10)]
print(a)
print("Last element:", a[-1])
print("Second last element:", a[-2])
print("First element:", a[-len(a)])
print(a[-len(a) - 1])  # IndexError

How to select multiple items?

We can use slicing to select a range of items as follows:

a[start:stop]
a[start:stop:step]

The selected items corresponds to those indexed using range:

(a[i] for i in range(start, stop))
(a[i] for i in range(start, stop, step))
a = (*range(10),)
print(a[1:4])
print(a[1:4:2])

Unlike range, the parameters for slicing take their default values if missing or equal to None:

a = [*range(10)]
print(a[:4])  # start defaults to 0
print(a[1:])  # stop defaults to len(a)
print(a[1:4:])  # step defaults to 1

The parameters can also take negative values:

print(a[-1:])
print(a[:-1])
print(a[::-1])  # What are the default values used here?

A mixture of negative and postive values are also okay:

print(a[-1:1])      # equal [a[-1], a[0]]?
print(a[1:-1])      # equal []?
print(a[1:-1:-1])   # equal [a[1], a[0]]?
print(a[-100:100])  # result in IndexError like subscription?

Can AI explain the rules for slicing?

%%ai chatgpt -f text
Explain how the default values of start, stop, and step are determined in 
the following slicing operations in python:
print(a[-1:1])      # equal [a[-1], a[0]]?
print(a[1:-1])      # equal []?
print(a[1:-1:-1])   # equal [a[1], a[0]]?
print(a[-100:100])  # result in IndexError like subscription?
def sss(a, i=None, j=None, k=None):
    ### BEGIN SOLUTION
    l = len(a)
    step = 1 if k is None else k
    m = l if step > 0 else l - 1
    start = 0 if i is None else min(i if i > 0 else max(i + l, 0), m)
    stop = l if j is None else min(j if j > 0 else max(j + l, 0), m)
    ### END SOLUTION
    return start, stop, step


a = [*range(10)]
assert sss(a, -1, 1) == (9, 1, 1)
assert sss(a, 1, -1) == (1, 9, 1)
assert sss(a, 1, -1, -1) == (1, 9, -1)
assert sss(a, -100, 100) == (0, 10, 1)
def quicksort(seq):
    """Return a sorted list of items from seq."""
    if len(seq) <= 1:
        return list(seq)
    i = random.randint(0, len(seq) - 1)
    pivot, others = seq[i], [*seq[:i], *seq[i + 1 :]]
    left = quicksort([x for x in others if x < pivot])
    right = quicksort([x for x in others if x >= pivot])
    return [*left, pivot, *right]


seq = [random.randint(0, 99) for i in range(10)]
print(seq, quicksort(seq), sep="\n")
Solution to Exercise 7

The above recursion creates a sorted list as [*left, pivot, *right] where

  • pivot is a randomly selected item in seq,
  • left is the sorted list of items smaller than pivot, and
  • right is the sorted list of items no smaller than pivot.

The base case happens when seq contains at most one item, in which case seq is already sorted.

Quick sort is an example of randomized algorithm. In particular, the pivot is randomly chosen. Why?

%%ai chatgpt -f text
For the quick sort algorithm, is it okay to pick the pivot deterministically,
say the first element of the sequence?
%%ai chatgpt -f text
What is randomized algorithm and how randomization helps?

Mutation

%%ai chatgpt -f text
Explain in a paragraph or two why one would prefer tuple over list in Python, 
given that list is mutable but tuple is not?

For list (but not tuple), subscription and slicing can also be used as the target of an assignment operation to mutate the list:

%%optlite -h 350
b = [*range(10)]  # aliasing
b[::2] = b[:5]
b[0:1] = b[:5]
b[::2] = b[:5]  # fails

Last assignment fails because [::2] with step size not equal to 1 is an extended slice, which can only be assigned to a list of equal size.

%%ai chatgpt -f text
Explain the following limitation of extended slice in python as compared to
the basic slice:
When assigning to an extended slice, the list on the right hand side of the 
statement must contain the same number of items as the slice it is replacing.

What is the difference between mutation and aliasing?

In the previous code:

  • The first assignment b = [*range(10)] is aliasing, which gives the list the target name/identifier b.
  • Other assignments such as b[::2] = b[:5] are mutations that calls __setitem__ because the target b[::2] is not an identifier.
list.__setitem__?
# %%optlite -l -h 400
a = b = [0]
b[0] = a[0] + 1
print(a[0] < b[0])
Solution to Exercise 8
  • The first line a = b makes a an alias of the same object b points to, and so
  • the second line mutates the same object a and b point to.
  • Hence, a[0] == b[0].
a = [0, 1]
i = 0
a.__setitem__(i := i + 1, i)
print(a)
a = [0, 1]
i = 0
a[i := i + 1] = a[i]
print(a)
Solution to Exercise 9

a[i := i + 1] = a[i] is not the same as calling a.__setitem__(i := i + 1, i). According to the python documentation,

  • the expression to be assigned, i.e., a[i], is first evaluated to a[0] and therefore 0;
  • since the target a[i := i + 1] is a user defined object, it continues to evaluate the target reference, i.e., the address of a, which corresponds to the list [0, 1],
  • followed by the subscription i:=i+1, which evaluates to 1;
  • Finally, a.__setitem__ is called with the subscription, 1, and expression to be assigned, 0, and so the list a points to is mutuated to [0, 1].

In comparison, directly calling a.__setitem__(i := i + 1, i)

  • first evaluates the first argument i := i + 1, which gives 1 that is assigned to i, and
  • then evaluates the second argument i to 1, and so
  • a.__setitem__(1, 1) is called instead, which does not change the list a points to.

Let’s see if AI has the correct understanding:

%%ai chatgpt -f text
Explain what gets printed when running the following python code:
--
a = [0, 1]
i = 0
a[i := i + 1] = a[i]
print(a)

Why mutate a list?

The following is another implementation of composite_sequence that takes advantage of the mutability of list.

%%optlite -r
def sieve_composite_sequence(stop):
    is_composite = [False] * stop  # initialization
    for factor in range(2, stop):
        if is_composite[factor]:
            continue
        for multiple in range(factor ** 2, stop, factor):
            is_composite[multiple] = True
    return (x for x in range(4, stop) if is_composite[x])


for x in sieve_composite_sequence(100):
    print(x, end=" ")

The algorithm

  1. changes is_composite[x] from False to True if x is a multiple of a smaller number factor, and
  2. returns a generator that generates composite numbers according to is_composite.
%%ai chatgpt -f text
Should `factor ** 2` be `factor * 2` in the following 
function that attempts to generates a sequence of composite numbers up to and
excluding stop?
--
def sieve_composite_sequence(stop):
    is_composite = [False] * stop  # initialization
    for factor in range(2, stop):
        if is_composite[factor]:
            continue
        for multiple in range(factor ** 2, stop, factor):
            is_composite[multiple] = True
    return (x for x in range(4, stop) if is_composite[x])
%%ai chatgpt -f text
Explain why `factor ** 2` is used instead of `factor * 2` in the following 
function that attempts to generates a sequence of composite numbers up to and
excluding stop.
--
def sieve_composite_sequence(stop):
    is_composite = [False] * stop  # initialization
    for factor in range(2, stop):
        if is_composite[factor]:
            continue
        for multiple in range(factor ** 2, stop, factor):
            is_composite[multiple] = True
    return (x for x in range(4, stop) if is_composite[x])
# A sample if you did not define composite_sequence before.
def composite_sequence(stop):
    return (x for x in range(2, stop) if \
            any(x % d == 0 for d in range(2, isqrt(x) + 1)))
%%timeit
for x in composite_sequence(10000): pass
%%timeit
for x in sieve_composite_sequence(10000): pass
for x in sieve_composite_sequence(10000000): pass
Solution to Exercise 10

The line if is_composite[factor]: continue avoids the redundant computations of checking composite factors.

%%optlite -h 300
a = [[0] * 2] * 2
a[0][0] = a[1][1] = 1
print(a)
### BEGIN SOLUTION
a = [[0] * 2 for i in range(2)]
### END SOLUTION
a[0][0] = a[1][1] = 1
print(a)
%%ai chatgpt -f text
Explain the different levels of copy for python lists.

Methods

There is also a built-in function sorted for sorting a sequence:

sorted?
sorted(seq)

Is quicksort quicker?

%%timeit
quicksort(seq)
%%timeit
sorted(seq)

Python implements the Timsort algorithm, which is very efficient.

What are other operations on sequences?

The following compares the lists of public attributes for tuple and list.

list_attributes = dir(list)
tuple_attributes = dir(tuple)

print(
    'Common attributes:', ', '.join([
        attr for attr in list_attributes
        if attr in tuple_attributes and attr[0] != '_'
    ]))

print(
    'Tuple-specific attributes:', ', '.join([
        attr for attr in tuple_attributes
        if attr not in list_attributes and attr[0] != '_'
    ]))

print(
    'List-specific attributes:', ', '.join([
        attr for attr in list_attributes
        if attr not in tuple_attributes and attr[0] != '_'
    ]))
  • There are no public tuple-specific attributes, and
  • all the list-specific attributes are methods that mutate the list, except copy.

The common attributes

  • count method returns the number of occurrences of a value in a tuple/list, and
  • index method returns the index of the first occurrence of a value in a tuple/list.
%%optlite -l -h 450
a = (1,2,2,4,5)
count_of_2 = a.count(2)
index_of_1st_2 = a.index(2)

reverse method reverses the list instead of returning a reversed list.

%%optlite -h 300
a = [*range(10)]
print(reversed(a))
print(*reversed(a))
print(a.reverse())
  • copy method returns a shallow copy of a list.
  • tuple does not have the copy method but it is easy to create a copy by slicing.
%%optlite -h 400
a = [*range(10)]
b = tuple(a)
a_reversed = a.copy()
a_reversed.reverse()
b_reversed = b[::-1]

sort method sorts the list in place instead of returning a sorted list.

%%optlite -h 300
import random
a = [random.randint(0,10) for i in range(10)]
print(sorted(a))
print(a.sort())
  • extend method that extends a list instead of creating a new concatenated list.
  • append method adds an object to the end of a list.
  • insert method insert an object to a specified location.
%%optlite -h 300
a = b = [*range(5)]
print(a + b)
print(a.extend(b))
print(a.append('stop'))
print(a.insert(0,'start'))
  • pop method deletes and return the last item of the list.
  • remove method removes the first occurrence of a value in the list.
  • clear method clears the entire list.

We can also use the function del to delete a selection of a list.

%%optlite -h 300
a = [*range(10)]
del a[::2]
print(a.pop())
print(a.remove(5))
print(a.clear())
Footnotes
  1. If nn is small (fewer than 100), the unbiased sample variance should be used.

  2. If nn is small, the factor 2 needs to be increased by looking up the tt-value from the student’s tt-distribution.