Hi, today I would like to deal with a very special kind of search trees. Today it is about the prefix tree or also called Trie.
This data structure, is a search tree, which splits whole words into individual letters, and so stores as a string. The name Trie, comes from the term "retrieval".
The words, as mentioned, are divided into individual letters and stored as a string. Words that begin the same have a common prefix, which means that the prefix is only stored once. One sees that here a light form of the compression is contained.
This data structure, is often used for indexing as well as searching for strings.
A trie looks like the one shown in the picture:
We see here in the image that our Trie, has stored the terms : hello, half, halt, hal9000 and halogen. All of them have the same prefix: hal.
For each word, the last node has an additional mark that contains the information that the word is over.
An example shows what I mean:
All red markings symbolize the end of a word. Here, however, the case can also occur that in a word, another word is contained, e.g. in the word halogen, the title of the game Halo.
The implementation of a trie is quite simple. The Trie consists, like the search trees, only of 2 classes Node and Trie.
The class Node is quite simple. Here we need an attribute, which stores a letter. We also need a list, which stores the adjacent nodes. An attribute, which marks the node as leaf or end of the word, is also needed. That's all it really is.
#!/usr/bin/env python
# -*- coding:utf-8 -*-
class Node(object):
__char = 0x00
__childs = None
__isLeaf = False
def __init__(self, char, leaf=False):
self.__char = char
self.__isLeaf = leaf
self.__childs = []
def getChar(self):
return self.__char
def isLeaf(self):
return self.__isLeaf
def setLeaf(self, x):
self.__isLeaf = x
def hasChild(self, char):
for n in self.__childs:
if n.getChat() == char:
return True
return False
def appendChild(self, node):
self.__childs.append(node)
def getChilds(self):
return self.__childs
If we look at the class, we see that the structure is similar to the class Node from a BST. Nevertheless, we have striking differences here. The getters, which return us once the letter and all the child nodes, are quite simple. The method isLeaf, tells us if the node is a leaf, if the word ends with these nodes. The method addChild, adds new child nodes.
With this we can build the structure, like on the picture. The method hasChild checks if a child node exists with the passed letter.
The class Trie is built a little bit more bulky. Let's go into it here, piece by piece.
Here we see that the class contains 2 attributes. The root node, from where we always start. This node has no letter that appears in any word. The node is only the access point to the search tree. In the class, we have an attribute that specifies the maximum word length. The maximum word length represents both the absolute height of the tree and the length of the longest word stored in the trie.
If we extend our class by the method add, we can now store words in the trie.
def insert(self, word):
n = self.__root
if self.__max_word_len < len(word):
self.__max_word_len = len(word)
for char in word:
found = False
for child in n.getChilds():
if char == child.getChar():
found = True
n = child
if not found:
newNode = node.Node(char, False)
n.appendChild(newNode)
n = newNode
n.setLeaf(True)
The method is structured in such a way that a node n is created first. This gets the root node assigned. The word, which we got passed, is processed here piece by piece. First, the length of the word is compared with the current maximum length max_word_len. If the new word is longer than the maximum length, the word length becomes the new maximum length.
The word is now processed letter by letter. Here always a flag found is set, which indicates whether the current letter is already present in a prefix or not. If it is present, the letter is simply skipped. The node n is set to its successor. If a letter is not present in the prefix, they are stored in new nodes and added as children to the end of the prefix. The last created node from the word, is marked as leaf or end of word.
In order to determine the height of the triangle or the length of the longest word, we create the method longestWord. This simply returns the value, max_word_len.
def longestWord(self):
return self.__max_word_len
To check if a word exists in the trie, we extend the class with the wordExists method.
def wordExists(self, word):
n = self.__root
exist = False
if len(n.getChilds()) == 0:
return False
for char in word:
if n.isLeaf():
exist = True
for child in n.getChilds():
if char == child.getChar():
exist = True
n = child
break
else:
exist = False
return exist
Here the whole word is processed and the trie is run through. Here, each letter is checked to see if it is on the path. If this is the case, our flag exist is set to True, otherwise False. After processing the word, we get back either a True or False, if the word exists and the last letter of the word is a leaf.
If we want to output all results that match the requested prefix, we need to extend the class with the search method. This method has a submethod traverse.
def search(self, word):
n = self.__root
for w in word:
match = False
for node in n.getChilds():
if node.getChar() == w:
n = node
match = True
if not match:
return []
# traverse currnt node to all leaf nodes
result = []
self.traverse(n, list(word), result)
return [''.join(r) for r in result]
def traverse(self, root, prefix, result):
if root.isLeaf():
result.append(prefix[:])
for n in root.getChilds():
prefix.append(n.getChar())
self.traverse(n, prefix, result)
prefix.pop(-1)
The method search, goes through each letter of the prefix or the word, checking if there are child nodes that match the word. If there is no match, an empty result list is returned. If there are matches, it traverses the tree until our prefix has been processed. From this point, using the traverse method, every word that matches the prefix is traversed and added to the result list.
If we build a main for this, it looks like this.
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import trie
if __name__ == '__main__':
wtrie = trie.Trie()
wtrie.insert("hallo")
wtrie.insert("car")
wtrie.insert("cart")
wtrie.insert("helgoland")
wtrie.insert("lotto")
wtrie.insert("autobahn")
wtrie.insert("cat")
wtrie.insert("caesar")
wtrie.insert("bst")
wtrie.insert("lamp")
wtrie.printAll()
print("The word car exists: ", wtrie.wordExists("car"))
print("The word kar exists: ", wtrie.wordExists("kar"))
print(wtrie.search(input("ENTER PREFIX > ")))
The output looks something like this. At the beginning we see all words, which are stored in the trie. The words "car" and "kar" are checked for existence. Here we see that the word "car" exists in the trie, but the other one does not. In the last output, we see that for the prefix "ca" all possible words are suggested.
hallo
helgoland
car
cart
cat
caesar
lotto
lamp
autobahn
bst
The word car exists: True
The word kar exists: False
ENTER PREFIX > ca
['car', 'cart', 'cat', 'caesar']
A more special form of the Tries is the Patricia-Trie. I will go into this in another post.
I hope I was able to give a little insight into the trie 😄