Abstract¶
Associative containers allow programmers to store and retrieve values using keys that are not restricted to consecutive integers starting from 0. Python’s dictionary and set will be introduced as examples of associative containers. Similar to sequence types, readers will learn how to construct these containers using enclosure and comprehension, access their keys and values, and utilize various methods to operate on them. To explain the near constant-time lookups, the idea of hashing will be introduced, along with the issues of hash collisions, the basic conflict resolution strategy, and the intricate relationship between hashability and immutability.
from __init__ import install_dependencies
await install_dependencies()
Motivation¶
The following code simulates the outcomes of rolling a die multiple times.
dice_rolls = [random.randint(1, 6) for i in range(10)]
print(*dice_rolls)
What is the distribution?
plt.figure(0, clear=True)
distribution = [dice_rolls.count(i) / len(dice_rolls) for i in range(7)]
plt.stem(range(7), distribution)
plt.xlabel("Outcomes")
plt.title("Distribution")
plt.ylim(0, 1)
plt.show()
In the above code, distribution[i]
stores the fractional count of outcome i
.
However, distribution[0]
is 0
because a dice does not have outcome 0
. Can we avoid such redundancy?
distinct_outcomes = [
outcome for outcome in range(1, 7) if dice_rolls.count(outcome) > 0
]
distribution = [
dice_rolls.count(distinct_outcomes[i]) / len(dice_rolls)
for i in range(len(distinct_outcomes))
]
plt.figure(1, clear=True)
plt.stem(distinct_outcomes, distribution)
plt.xlabel("Outcomes")
plt.title("Distribution")
plt.ylim(0, 1)
plt.show()
In the above code,
distinct_outcomes
stores the list of distinct outcomes, anddistribution[distinct_outcomes[i]]
stores the fractional count of thei
-th distinct outcome.
What about finding the distribution of characters in an article?
There are over 1 million unicode characters. Can we:
- Obtain the distribution efficiently without creating an entry for each unicode character?
- Avoid keeping a separate set of distinct characters in addition to their distribution?
- Access the fractional count of a particular character efficiently without searching through the list of distinct outcomes?
It is desirable to have a composite data type that
- can keep a set of unique keys of different types (such as the characters in our example), and
- associate to different keys possibly different values of any types (such as the fractional counts of the characters).
Such a data structure is called an associative container.
Construction¶
In Python, there are two built-in classes for associative containers:
set
can store unique keys of possibly different types.dict
ionary can store a set of key-value pairs.
We have already used sets and dictionaries before.
%%optlite -h 400
a = (lambda **kwargs: kwargs)(start=0, stop=5, step=1)
b = set([1, 1, 2, 3, 3, 3])
assert len(a) == len(b)
Both set
and dict
implement the len
method that returns the number of keys. They are mutable, so we can add/remove keys and values and modify their object references.
Similar to tuple/list, we can create a set/dictionary using enclosure and comprehension.
Enclosure¶
For dict
, enclose a comma-separated sequence of key : value
pairs by braces {
and }
.
%%optlite -h 400
empty_dictionary = {}
a = {"a": 0, "b": 1}
b = {**a, "c": 0, "d": 1}
For set
, omit : value
.
%%optlite -h 300
a = {(1, 2.0), print, *range(2), *"23"}
empty_set = {*()} # Why not use {}?
We can also create a set/dictionary from other objects using their constructors set
/dict
.
%%optlite -l -h 550
empty_set = set()
string2set = set("abc")
range2set = set(range(2))
list2set = set(["abc", range(2)])
set2set = set(list2set)
%%optlite -l -h 650
empty_dict = dict()
enumerate2dict = dict(enumerate("abc"))
zip2dict = dict(zip("abc", "123"))
kwargs2dict = dict(one=1, two=2)
dict2dict = dict(kwargs2dict)
dict.fromkeys?
### BEGIN SOLUTION
fromkeys_dict = dict.fromkeys(range(100), 0)
### END SOLUTION
# test
assert all(fromkeys_dict[k] == 0 for k in fromkeys_dict)
Comprehension¶
The following function uses a one-line dictionary comprehension to return the distribution of items in a sequence:
def distribute(seq):
return {k: seq.count(k) / len(seq) for k in set(seq)}
def plot_distribution(seq, fig_num=None):
dist = distribute(seq)
plt.figure(fig_num, clear=True)
plt.stem(
dist.keys(), # set-like view of the keys
dist.values(), # view of the values
)
plt.xlabel("Items")
plt.title("Distribution")
plt.ylim(0, 1)
plt.show()
plot_distribution("What is the distribution of different characters?", 2)
- The object methods
keys
andvalues
provide a dynamic view of the keys. - Unlike a copy, subsequent changes to the dictionary are also reflected in a previously returned view.
items
provides a set-like view of the key-value pairs.
%%optlite -h 500
a = dict(enumerate("abc"))
views = a.keys(), a.values(), a.items()
print(a.pop(1)) # remove the key 1 and its associated value
print(a.popitem()) # remove and return one key-value pair
a.clear() # clear the dictionary
set
has pop
and clear
but not popitem
. However, set.pop
behaves like dict.popitem
instead of dict.pop
.
%%optlite -h 250
a = set("abc")
print(a.pop()) # remove and return an element
a.clear() # clear the set
def composite_set(stop):
### BEGIN SOLUTION
return {x for factor in range(2, stop) for x in range(factor ** 2, stop, factor)}
### END SOLUTION
print(*sorted(composite_set(100)))
Fast access to items¶
Hashing¶
Why are some keys missing in the following dictionary/set?
%%optlite -h 350
a = {0: "a", 0.0: "b", 2: "b"}
b = {0j, 0, 0.0, "", False}
assert 0 == 0.0 == 0j == False != ""
Associative containers are implemented as hash tables for efficient lookup of keys.
Hash Tables
Python also uses dictionaries to implement global/local frames, and hash collisions can slow down the lookup process.
What are hashable?
def hashable(obj):
try:
hash(obj)
except Exception:
return False
return True
for i in 0, 0.0, 0j, "", False, (), [], {}, set(), frozenset(), ([],):
if hashable(i):
print("{} may be hashable. E.g., hash({!r}) == {}".format(type(i), i, hash(i)))
else:
print("{} may not be hashable.".format(type(i)))
set
has an immutable counterpart called frozenset
, which is hashable.
hashable(frozenset())
assert 0 == 0.0 == 0j == False != ""
assert hash(0) == hash(0.0) == hash(0j) == hash(False) == hash("")
Solution to Exercise 3
- To avoid duplicate keys occupying different entries in a hash table.
- Hash collision can be detected by
==
and handled by collision resolution techniques. To keep the hash table small, it is unavoidable to have hash collisions.
Solution to Exercise 4
No. hashable(([],))
returns False in particular because the value of an immutable object can still be changed if it points to a mutable object.
Solution to Exercise 5
No because id
remains unchanged even if the value of an object changes.
%%ai chatgpt -f text
Why most mutable objects are not hashable?
%%ai chatgpt -f text
Why `dict` does not have any immutable counterpart but `set` has an immutable
counterpart called `frozenset`?
Selection¶
How to traverse a set/dictionary?
Set and dictionaries are iterable:
%%optlite -h 500
s = set("abcde")
d = dict(enumerate("abcde"))
print(*(element for element in s))
print(*((k, d[k]) for k in d))
s[0] # TypeError
The above raises KeyError
because -1
is not a key in the dictionary b
.
Dictionary implements the __setitem__
method so we can enter a key-value pair to a dictionary using the assignment operator.
%%optlite -h 400
d = {}
d[-1] = d
del d[-1]
d[-1]
To avoid KeyError
, we can check if a key is in a dictionary efficiently (due to hashing) using the in
operator.
The following is a different implementation of distribute
.
def distribute(seq):
dist = {}
for i in seq:
dist[i] = (dist[i] if i in dist else 0) + 1 / len(seq)
return dist
plot_distribution("What is the distribution of different characters?", 3)
Solution to Exercise 6
It is more efficient because
- the alternative implementation traverses
seq
once with near constant time lookup of the key, but - the list comprehension can traverse
seq
a multiple times linear inlen(seq)
, since every call toseq.count
has to traverseseq
once.
Shorter code needs not be more efficient.
dict.get?
def distribute(seq):
dist = {}
for i in seq:
### BEGIN SOLUTION
dist[i] = dist.get(i, 0) + 1 / len(seq)
### END SOLUTION
return dist
plot_distribution("What is the distribution of different characters?", 4)
We can apply the function sorted
to a set/dictionary to return a sorted list of the keys.
%%optlite -h 600
a = set(reversed("abcde"))
b = dict(reversed([*enumerate("abcde")]))
sorted_elements = sorted(a)
sorted_keys = sorted(b)
def plot_distribution(seq, fig_num=None):
dist = distribute(seq)
plt.figure(fig_num, clear=True)
# plt.stem(dist.keys(), dist.values())
### BEGIN SOLUTION
plt.stem((_ := sorted(dist.keys())), [dist[k] for k in _])
## Alternative
# dist_list = sorted(dist.items(), key=lambda p: p[0])
# plt.stem(
# [p[0] for p in dist_list], [p[1] for p in dist_list]
# )
### END SOLUTION
plt.xlabel("Items")
plt.title("Distribution")
plt.ylim(0, 1)
plt.show()
plot_distribution("What is the distribution of different characters?", 5)
Instead of subscription, set
has the add
/discard
/remove
methods for adding/removing elements.
%%optlite -h 400
a = set("abc")
a.add("d")
a.discard("a")
a.remove("b")
a.clear()
a.discard("a") # no error
a.remove("b") # KeyError
%%ai chatgpt -f text
Identify and explain any errors below:
--
a = set("abc")
a.add("d")
a.discard("a")
a.remove("b")
a.clear()
a.discard("a")
a.remove("b")
%%ai chatgpt -f text
Explain the differences between discard, remove, and del.
Methods¶
Unlike str
/tuple
/list
, set
and dict
do not implement addition +
and multiplication *
:
any(
hasattr(container, attr)
for attr in ("__add__", "__mult__")
for container in (dict, set, frozenset)
)
set1 = set("abc")
set2 = set("cde")
### BEGIN SOLUTION
concatenated_set = {*set1, *set2}
### END SOLUTION
concatenated_set
dict1 = dict(enumerate("abc"))
dict2 = dict(enumerate("def", start=2))
### BEGIN SOLUTION
concatenated_dict = {**dict1, **dict2}
### END SOLUTION
concatenated_dict
set
overloads many other operators:
%%optlite -h 550
a, b = {1, 2}, {2, 3}
union = a | b
assert all(i in union for i in a) and all(i in union for i in b)
intersection = a & b
assert all(i in a and i in b for i in intersection)
assert intersection <= a <= union # subset
assert union > b > intersection # proper superset
assert len(a) + len(b) == len(intersection) + len(union)
symmetric_difference = a ^ b
assert all((i in a or i in b) and not (i in a and i in b) for i in symmetric_difference)
assert symmetric_difference == union - intersection
assert set.isdisjoint(intersection, symmetric_difference)
assert len(union) == len(intersection) + len(symmetric_difference)
The following uses &
and -
to compare the sets of public attributes for set
and dict
:
set_attributes = {attr for attr in dir(set) if attr[0] != "_"}
dict_attributes = {attr for attr in dir(dict) if attr[0] != "_"}
print("Common attributes:", ", ".join(set_attributes & dict_attributes))
print("dict-specific attributes:", ", ".join(dict_attributes - set_attributes))
print("set-specific attributes:", ", ".join(set_attributes - dict_attributes))
For set
, the intersection operation &
can also be performed by
- the class method
intersection
, which returns the intersection of its arguments, and - the object method
intersection_update
, which mutates a set object by intersecting the set with the arguments.
%%optlite -h 300
a = {0, 1, 2}
b = {1, 2, 3}
c = set.intersection(a, b, {2, 3, 4})
a.intersection_update(b, c)
- All other set-specific methods have an associated operator except
isdisjoint
as shown below. - The object method for
union
isupdate
notunion_update
.
class method | object method | operator |
---|---|---|
union | update | | |
intersection | intersection_update | & |
symmetric_difference | symmetric_difference_update | ^ |
issubset | <= | |
issuperset | >= | |
isdisjoint |
dict
also has an update
method that can update a dictionary using dictionary, iterables, and keyword arguments:
%%optlite -h 300
a = {}
a.update(enumerate("a"), b=2)
b = a.copy()
a.update(b, c=3)
def group_by_type(seq):
group = {}
for i in seq:
### BEGIN SOLUTION
group.setdefault(repr(type(i)), []).append(i)
### END SOLUTION
return group
group_by_type(
[
*range(3),
*"abc",
*[i / 2 for i in range(3)],
*[(i,) for i in range(3)],
*[[i] for i in range(3)],
*[{i} for i in range(3)],
*[{i: i} for i in range(3)],
print,
hash,
int,
str,
float,
set,
dict,
(i for i in range(10)),
enumerate("abc"),
range(3),
zip(),
set.add,
dict.copy,
]
)
%%ai chatgpt -f text
Explain when we should use each of the methods or functions on Python `dict`:
del, discard, remove, pop, popitem