Avatar
README.md (06 Dec 2021)

Python’s Dict, from scratch

Motivation

Python’s Dict is everywhere, and if you code in Python, you’re probably using it everyday. For instance, the famous Two Sum puzzle can easily be solved using a Dict. Why? Because contrary to a List, a Dict can be used for O(1) lookup, insertion and removal.

What we mean when we say O(1), is that looking for, inserting or removing an element from a Dict is done in constant time, and does not depend on the number of elements currently in the Dict. List also have an O(1) insertion time (simply add the element at the end of the list) but have lookup and removal that are O(N): to look up a specific element you need to go through the entire List (consider the case where the element is the last, or worst: not in the list!) and to remove the element, you need first to look it up, then rebuild the list after the removed element.

This blog post attempts to explain the magic behind Python’s Dict, by reconstructing it from scratch using only Python code. Most of the information contained in this blog post can be traced back either directly to cpython source code or the two top answers of How are Python’s Built In Dictionaries Implemented? on StackOverflow.

Introduction

From our point of view, a dict is an object which we can populate with values associated to keys. In the following example, we’re linking capital cities to their respective countries:

1country_to_capital = dict()
2country_to_capital['France'] = 'Paris'
3country_to_capital['UK'] = 'London'
4
5print(country_to_capital.get('UK'))
6>>> 'London'

If we’d want to do that with a List, we’d have to do it the hard way

1country_to_capital = list()
2country_to_capital.append(['France', 'Paris'])
3country_to_capital.append(['UK', 'London'])
4
5for key, value in country_to_capital:
6  if key == 'UK':
7    print(value)
8    break

Dict is practical, but also fast: looking for UK’s capital is done at the same speed, no matter the size of our Dict (i.e. the number of country-capital pairs already inserted), contrasting with List where you could possibly end up parsing the entire list (which happens everytime you’re looking for something that’s not there!).

But how does this work behind the curtains? Well, from the point of view of the Python FAQ itself, “CPython’s dictionaries are implemented as resizable hash tables”. What does this mean? Let’s find out!

Hashing and building the base dict

When we’re looking for a specific key in a List, we’re doing extactly that: looking for it: we have to parse the List until we find it. And this is why it’s O(n). To get to O(1) is pretty intuitive: when we want to look for a key in a List, we need to know where it is, withtout having to look for it at every place.

That’s where hash tables come in. Hash tables are Lists that use hashes to know where things are stored. As a quick reminder, Python’s hash function can be seen as a function that assigns an integer (called a hash) to any object (let’s say a string or an integer), with the property that (almost always) different object will have different hashes.

To be really precise, Python can only assign hashes to immutable objects. These are objects that cannot be changed, such as ints and strings. It doesn’t make sense to hash a mutable object, such as a List, because the hash must represent the object very specifically. A mutable object can change at any time, and its hash wouldn’t reflect those changes.

1print(hash("What's the hash of this sentence??"))
2>>> 4343814758193556824
3
4# any int will be equal to its hash
5print(hash(12))
6>>> 12

This can help us achieve the wanted O(1) complexity we’re looking for.

The way hash tables are implemented is very simple: instead of starting with an empty List and appending objects arbitrarily one after the other, we start with a List that has enough room already, and place objects at specific locations when they arrive. That way, since each object has a specific location, we can simply search that location to check if it’s there or not.

Where do hashes come in? Well, we know the size of our List, let’s say N, and the object’s hash, let’s say h, is an integer. Simply taking h mod N will give us a specific place to look for! In the Python code below, this can be seen at lines 12 and 19.

Interestingly, when N is a power of 2, let’s say N = 2**k, computing h mod N is the same as taking the last k binary factors of h, which is extra cheap to do!

 1class PythonDict:
 2    """
 3    Our own basic implementation of Python Dict!
 4    """
 5
 6    def __init__(self):
 7        # empty List that will store key-value pairs
 8        self.storage = [None] * 8
 9
10    def add(self, key, value):
11        # compute the hash and its modulo
12        idx = hash(key) % len(self.storage)
13
14        # Add the key-value pair at that slot
15        self.storage[idx] = (key, value)
16
17    def remove(self, key):
18        # compute the hash and its modulo
19        idx = hash(key) % len(self.storage)
20
21        # remove the key-value pair at that slot
22        self.storage[idx] = None
23
24    def __getitem__(self, key):
25        """ Magic method used as in `d[key]` """
26        idx = hash(key) % len(self.storage)
27        value = self.storage[idx]
28        if value is None:
29            raise KeyError(f'{key=}')
30        return value
31
32    def __setitem__(self, key, value):
33        """ Magic method used as in `d[key] = value` """
34        self.add(key, value)

Let’s check that it works:

1countries_to_capital = PythonDict()
2countries_to_capital['France'] = 'Paris'
3countries_to_capital['UK'] = 'London'
4
5print(countries_to_capital['UK'])
6>>> 'London'

Neat, it does work! Or does it?

Dealing with collisions: Open Addressing and Tombs

You are probably wondering about what happens when two keys have the same hash (not probable), or when their hashes have the same modulo (happens often). We call such situations collisions.

1letters = PythonDict()
2letters[1] = 'a'
3letters[9] = 'i'
4
5print(letters[1])
6>>> 'i'

In this case, the above behavior happens because an int is always equal to its hash, leading to 1 mod 8 == 9 mod 8. In other words, 1 and 9 make us search for them at the same place!

When it comes to collisions, we have two solutions.

  • Tired: simply have a list at each place for all the objects with same hash/modulo. This works but then we’re back to iterating over the list when there are several objects at the same place. At least, these lists are smaller!
  • Wired: find another place. But we can’t choose at random: remember that we’re trying to be efficient and we must know where things are.

Finding another place is called Open Addressing, and can be done in many ways. The easiest way is simply to try and go directly to the next spot, until we find an empty spot. This is called linear probing, and it does works. Unfortunately, it is undesirable to be implemented by default by Python, due to the hash functions already mapping int to themselves. When you’re creating a dict from a range of integers, you will already fill spots in your dicts that are contiguous, because when hashes differ by one, so does their modulos!. This can lead to a crowding effect, where large contiguous part of your storage are occupied and linear probing leads back to O(N)!

Python solves this by having emprically found a “good” probing function. This probing function is used to know the correct place for any key-value pairs. Now, because we’ll be using this function at different places (for look-up, insertion and removal), we’ll make it a method on its own! In the Python code below, the smart_search method is implemented at lines 14-50.

To put it briefly, the method first looks at the intuitive spot, h mod N at line 30, and if there’s a collision looks for the next spot (at line 47) following this formula:

next_spot = ((5 x current_spot) + 1) mod N

In fact, CPython’s probing function is a bit more complicated and involves playing around with bits. You can see all details in the source code if you are interested, but I have chosen to keep the simplied version for this post.

Given a key-value pair and an occupied spot, there are two possibilities for why that spot is already occupied: 1) we already placed the key-value pair here and 2) an other key-value pair was already placed here. To differentiate between the two (it all happens at line 42), we first compare the hashes: they could be different but still lead to the same modulo hence the collision. If they’re the same, we fall back to python ==method.

Checking for equality between keys can actually be arbitrarily costly, depending on the type of keys. In practice, keys are often strings, ints or tuples, for wich equality is fast to check.

Now we can deal with the insertion and look-up of any object, without worrying about collisions.

1letters = PythonDict()
2letters[1] = 'a'
3letters[9] = 'i'
4
5print(letters[1])
6>>> 'a'
7
8print(letters[9])
9>>> 'i'

All good right? Wait a minute, what about removing objects? In fact, we’re curently in trouble. Consider the situation from the above snippet: we’ve added a key-value pair p1 (1, 'a') – line 2, and then try and add a second pair p2 (9, 'i') – line 3 – that unfortunately collides with p1. We use our smart_search to find the next good empty spot, and place p2 there. Good. If we’re to search for p2 again, no problem: we’ll compute its hash and modulo, see that there is p1 in its place, reuse the smart_search and eventually arrive a the correct spot – lines 5-9.

But what if we delete p1 and then search for p2? That’s where our troubles come from: since p1 is not in our storage anymore, smart_search’s first guess using p2’s computed hash and modulo will find an empty spot (which before collided with p1), and we will simply think that p2 has never been added – lines 7-8.

1letters = PythonDict()
2letters[1] = 'a'
3letters[9] = 'i'
4
5letters.remove(1)
6
7print(letters[9])
8>>> KeyError('key=9')

This is why we add tombs: everytime we delete an object, we have to remember that it was there at some point, and leave a mark so that smart search knows to skip this spot, and look further. This happens at line 68.

Note that in the below implementation, now we also keep the hash of the keys we store. This is done to avoid recomputing many hashes many times.

 1class PythonDict:
 2    """
 3    Our own (less so) basic implementation of Python Dict!
 4    Now with:
 5        - smart_search to deal with collisions
 6    """
 7
 8    DEADSLOT = '<slot cannot be used>'
 9
10    def __init__(self):
11        # empty List that will store hash-key-value pairs
12        self.storage = [None] * 8
13
14    def smart_search(self, h, key):
15        """
16        1) We compute <idx> the position in storage using the key's hash
17        2) At <idx> in storage, we either find:
18            - None: no key with that hash exists
19            - a triplet : we have already placed an object at that slot
20        3) If we found an triplet at 2), then we compare keys at that position:
21            - if not same hash (collision due to modulo): we continue our search
22            - if same hash, but not same key (collision due to hash function): we continue our search
23            - if same hash and same key: we set exists to True
24
25        RETURNS:
26            idx: index of the slot of where to put the object
27            exists: whereas object was already there, this is for convenience
28        """
29        # Computing intial place to store the object
30        idx = h % len(self.storage)
31
32        # Deterministic search while there is an object at that position
33        while self.storage[idx] is not None:
34
35            collision = self.storage[idx]
36
37            # The collision could be due to a dead slot
38            if collision != self.DEADSLOT:
39
40                # comparing hash and key of an occupied slot
41                other_h, other_key, _ = collision
42                if h == other_h and key == other_key:
43                    return idx, True  # True means obj exists already
44
45            # If collision is due to a dead or occupied slot
46            # We look for another place with a (simplified) non-linear probing algorithm
47            idx = ((5 * idx) + 1) % len(self.storage)
48
49        # At this point, we found an empty slot that we can return
50        return idx, False  # False means obj doesn't exist yet
51
52    def add(self, key, value):
53        # Computing hash once
54        h = hash(key)
55
56        # Find the correct slot for the given key
57        idx, exists = self.smart_search(h, key)
58
59        # Add the hash-key-value triplet at that slot
60        self.storage[idx] = (h, key, value)
61
62    def remove(self, key):
63        # Find the correct slot for the given key
64        idx, exists = self.smart_search(hash(key), key)
65
66        # remove the key-value pair at that slot
67        # placing a DEADSLOT to avoid breaking self.smart_search
68        self.storage[idx] = self.DEADSLOT
69
70    def __getitem__(self, key):
71        """ Magic method used as in `d[key]` """
72        idx, exists = self.smart_search(h, key)
73        if not exists:
74            raise KeyError(f'{key=}')
75        return self.storage[idx]
76
77    def __setitem__(self, key, value):
78        ...

Variable size Dict

Ok, but what about pigeon holes? We’re starting with an internal storage List of size 8, but Python’s Dict should be able to handle more than eight key-value pairs! So what’s the trick? The trick is the resizable of resizable hash tables. When we’re nearing completion of the storage list, we simply rebuild the dict with a bigger initial list (line 53-54). That’s it, that’s the trick.

In practice, the load factor is kept under 2/3, meaning that the Dict is resized when it is more than 2/3rd full.

In the following snippet, we only show the method that changed from the previous snippet. Note that we could also resize down when removing too many objects, but this would needlessly obsfucate the code at this stage.

 1class PythonDict:
 2	"""
 3	Our own nearly complete implementation of Python Dict!
 4    Now with:
 5        - smart_search to deal with collisions
 6        - dynamic resizing
 7	"""
 8
 9    DEADSLOT = '<slot cannot be used>'
10
11    def __init__(self):
12        # empty List that will store hash-key-value triplets
13        self.storage = [None] * 8
14
15				# Keeping track of objects and deadslots for resizing purposes
16        self._len, self._n_dead_slots = 0, 0
17
18    def __len__(self):
19        return self._len
20
21    def smart_search(self, h, key):
22        ...
23
24    def resize(self, new_size):
25        """Resize the internal storage to new_size"""
26        # Initialize new storage, keeping the previous one (all our objs are in it!)
27        storage, self.storage = self.storage, [None] * new_size
28
29        # Reinitializing _len and _n_dead_slots
30        self._len, self._n_dead_slots = 0, 0
31
32        # Re-adding all valid objects
33        for obj in storage:
34            if obj != self.DEADSLOT:
35                h, key, value = obj
36                self.add(key, value, h)
37
38    def add(self, key, value, h=None):
39        # Computing hash once if needed
40        h = hash(key) if h is None else h
41
42        # Find the correct slot for the given key
43        idx, exists = self.smart_search(h, key)
44
45        if not exists:
46
47            # Increasing count of existing objects
48            self._len += 1
49
50            # Add the hash-key-value triplet at that slot
51            self.storage[idx] = (h, key, value)
52
53            # If the number of occupied slot (legit + tombs) is more than 2/3rd
54            # we resize the dict.
55            if self._len + self._n_dead_slots > (len(self.storage) * 2/3):
56                self.resize(len(self.storage) * 2)
57
58        else:
59            # We do not increase the count of existing objects
60            # simply replace the old value by the newest one
61            self.storage[idx] = (h, key, value)
62
63
64    def remove(self, key):
65        # Find the correct slot for the given key
66        idx, exists = self.smart_search(hash(key), key)
67
68        if exists:
69
70            # Decreasing count of existing objects
71            # and increasing count of dead slots
72            self._len -= 1
73            self._n_dead_slots += 1
74
75            # remove the key-value pair at that slot
76            # placing a DEADSLOT to avoid breaking self.smart_search
77            self.storage[idx] = self.DEADSLOT
78
79    def __getitem__(self, key):
80        ...
81
82    def __setitem__(self, key, value):
83        ...

Memory Efficiency and Dict Ordering

Now we have most of the functionality. We can optimize things further, by noting that we are using an internal storage List consisting of mostly empty spots reserved for future hash-key-value triplets, and that takes a lot of unused space.

For instance, on a 64 bits (i.e. 8 bytes) architecture, each slot in the dict takes 3x8=24 bytes.

Instead, Python’s Dict simply stores indices that map to a much smaller, compact List. Everything stays the same as before, but now our mostly-empty storage List has slots of size 8 bytes, reducing the amount of unused memory by 3!

Note that now, we can iterate over the compact List, with method PythonDict.items(), which gives us all the objects in the dict sorted by insertion order!

Notice that almost all methods have changed this time.

  1class PythonDict:
  2	"""
  3	Our own complete implementation of Python Dict!
  4    Now with:
  5        - smart_search to deal with collisions
  6        - dynamic resizing
  7        - memory efficiency and ordering by insertion order
  8	"""
  9
 10    DEADSLOT = '<slot cannot be used>'
 11
 12    def __init__(self):
 13        # storage List now only stores index of occupied slots
 14        self.storage = [None] * 8
 15
 16        # List that will store actual hash-key-value triplets
 17        self._items = list()
 18
 19        # Keeping track of objects and deadslots for resizing purposes
 20        self._len, self._n_dead_slots = 0, 0
 21
 22    def __len__(self):
 23        return self._len
 24
 25    def items(self):
 26        """Go through Dict's items in insertion order"""
 27        for obj in self._items:
 28            if obj != self.DEADSLOT:
 29                h, key, value = obj
 30                yield key, value
 31
 32    def smart_search(self, h, key):
 33        """
 34        1) We compute <idx> the position in index_storage using the hash
 35        2) At <idx> in index_storage, we either find:
 36            - None: no key with that hash exists
 37            - an index: we have to look in storage to know more
 38        3) If we found an index at 2), then we compare keys at that position:
 39            - if not same hash (collision due to modulo): we continue our search
 40            - if same hash, but not same key (collision due to hash function): we continue our search
 41            - if same hash and same key: we set exists to True
 42
 43
 44        RETURNS:
 45            idx: index of the slot of where to put the object
 46            exists: whereas object was already there, this is for convenience
 47        """
 48        # Computing intial place to store the object
 49        idx = h % len(self.storage)
 50
 51        # Deterministic search while there is an index at that position
 52        while self.storage[idx] is not None:
 53
 54            # fetching the actual object in items
 55            collision = self._items[self.storage[idx]]
 56
 57            # The collision could be due to a dead slot
 58            if collision != self.DEADSLOT:
 59
 60                # comparing hash and key of an occupied slot
 61                other_h, other_key, _ = collision
 62                if h == other_h and key == other_key:
 63                    return idx, True  # True means obj exists already
 64
 65            # If collision is due to a dead or occupied slot
 66            # We look for another place with a (simplified) non-linear probing algorithm
 67            idx = ((5 * idx) + 1) % len(self.index_storage)
 68
 69        # At this point, we found an empty slot that we can return
 70        return idx, False  # False means obj doesn't exist yet
 71
 72    def resize(self, new_size):
 73        """Resize the internal storage to new_size"""
 74
 75        # Initialize new storage (no need to keep the previous one,
 76        #                         because it contains useless indices)
 77        self.storage = [None] * new_size
 78
 79        # Initialize new _items, keeping the previous ones
 80        _items, self._items = self._items, list()
 81
 82        # Reinitializing _len and _n_dead_slots
 83        self._len, self._n_dead_slots = 0, 0
 84
 85        # Re-adding all valid objects
 86        for obj in _items:
 87            if obj != self.DEADSLOT:
 88                h, key, value = obj
 89                self.add(key, value, h)
 90
 91    def add(self, key, value, h=None):
 92        # Computing hash once if needed
 93        h = hash(key) if h is None else h
 94
 95        # Find the correct slot for the given key
 96        idx, exists = self.smart_search(h, key)
 97
 98        if not exists:
 99
100            # Increasing count of existing objects
101            self._len += 1
102
103            # Append the hash-key-value triplet to _items
104            # And add the index of newly added object to storage
105            self.storage[idx] = len(self._items)
106            self._items.append(tuple((h, key, value)))
107
108            if self._len + self._n_dead_slots > (len(self.storage) * 2/3):
109                self.resize(len(self.storage) * 2)
110
111    def remove(self, key):
112        # Find the correct slot for the given key
113        idx, exists = self.smart_search(hash(key), key)
114
115        if exists:
116
117            # Decreasing count of existing objects
118            # and increasing count of dead slots
119            self._len -= 1
120            self._n_dead_slots += 1
121
122            # remove the key-value pair at that slot
123            # placing a DEADSLOT to avoid breaking self.smart_search
124            # Notice that we don't remove the index, because we keep the deadslot
125            self._items[self.storage[idx]] = self.DEADSLOT
126
127    def __getitem__(self, key):
128        ...
129
130    def __setitem__(self, key, value):
131        ...

And voilà, Python’s Dict from scratch in pure Python!

Future improvements:

  • Implement all functionalities. To obtain a real Python Dict we would need to implement a number of methods. For instance __contains__() in order to check if a key has already been added. Or the often used keys() and values(). We could also create its __repr__ for pretty printing!
  • Be more efficient with memory. While we need to keep tombs, we could certainly replace them when adding a new object. As an idea, the smart_search method could track the first DEADSLOT encountered, and if exists=False could then return this index rather than the last one. We could also resize our PythonDict when removing objects!