A Radix Tree is a data structure that can be used for writing autocomplete functions. The basic process is iterating all the autocomplete options, adding them to the tree, creating new nodes for each word. And then when a user has typed something, you take what they type, traverse the tree, and then collect all children in the subtree you are left with.
Radix trees are actually optimized (compressed) prefix trees, so I’ll explain what those are, how radix trees improve the idea. Then I’ll describe how a radix tree can be generated in JS, and then how searching for words works.
If you are just interested in the final code, it can be found here
A prefix tree is a tree datastructure with each node representing a character in a word. The root representing the beginning of a word. If cat was added to a prefix tree it would go root -> c -> a -> t, and it car was added as well the a node would have a r and t child node.
Nodes that represent an end of a word have to be specifically marked as word nodes. So c, and a in the above example would not be word nodes, but t, and r would be.
When adding words you simply loop through the word’s characters, start at root and look for the child representing the current character. If you reach the last character, and there is a node for the last character, mark it as a word. If you reach a point where the current node doesn’t have a child for the next character, start adding children nodes for each remaining character.
When searching words given a prefix, you traverse the tree similar to how you did when adding words until you reach the end of the prefix, or the tree is missing a character in the prefix. If the tree is missing a character then there are no words with the prefix. However if you reach the last character of the prefix you can then perform a search of the subtree rooted at that last characters node, and find all child nodes that are marked as words. These will be the words that begin with the prefix.
A radix tree takes the prefix tree and optimizes it. There are a lot of unecessary nodes in a prefix tree. In the example above, there is no need for a c or an a node, and they can be combined into one node. That node would then have the two children r and t
Radix trees typically use edges to represent a string of characters, instead of the nodes. So the radix tree of the node above has one edge coming off the root node, with the string CA. That means that all words in the subtree rooted at that first child begin with CA.
Where this gets complicated is if we wanted to add a word like company which would split up our CA edge up, and modify the already existing tree. This is the payoff of radix trees: increased complexity when inserting words for a more efficient data structure with faster lookup.
I’ll start with some skeleton code for the classes I’ll use…
Each node will be initialized with the edge label that leads to it, and it will have an object of all it’s children.
To make the code easier to write later, I’ll store the children as a dictionary, with the key bing the first character of that child’s edge label.
Now let’s figure out how the first word will be added
So we make a new RadixNode instance, with the given word, and make it a child of the root node. The root node (currentNode) has a property children that we treat like a dictionary. We add a key word which is the first character of the given word. We then set the value to the new node we made.
So we now have a radix tree with two nodes and one word.
Adding other words gets more complicated, because we need logic to split apart the existing nodes to make room for the new nodes.
We now iterate over each character of the given word, and check to see if that character is the beginning of one of the current nodes edges.
If there isn’t an edge that matches, we simply create a new child node, and make the edge label the remaining characters of the given word.
For the complicated part we’ll need a helper function to get the common prefix between two strings.
Add this function outside of both classes.
Now when the current node has an edge we can follow there are 4 scenarios.
The edge label is exactly the same as what’s left of the word.
The edge label contains all of what’s left of the word plus some extra. (edge label is facebook and the word is face)
The word contains all of the edge label plus some extra. (edge label is face and the word is facebook)
The edge label and the word share a common prefix, but both differ at some point. (edge label is farm and the word is face)
The easiest case is if the edge label and what’s left of the word are exactly the same.
If the what’s left of the word is all contained in the edge label, we just make a new word splitting up the edge label.
The complicated one, is if what’s left of the word and the edge label share a common prefix, but both differ at some point.
The last option is that the edge label is entirely contained in what’s left of the word, so we just follow it and update the current node, taking off the edge labels characters from what’s left of the word.
Here is what the whole addWord function looks like.
Searching for words is much simpler than adding words. We traverse the tree finding edges that match a given prefix, and then perform a depth first search and the node we end on.
If we make it through traversing the tree, and making sure the edge label’s match what’s left of the prefix, Than we know there are some words in the tree that begin with the given prefix.
Conveniently that currentNode variable happens to be at the root of the subtree that contains all those words, so we only need to do a depth first search to find them all.
Finally, We can make the getWords more efficient by making it asynchronous, since the order it finds words doesn’t matter.
The whole file then will look like.
And if we run it…
We’ll get the output…
Event Loop Blocking
When I was implementing this code, I was building an Express endpoint that would take a prefix a user had typed and return autocomplete suggestions for movie titles.
There are a little over 500,000 movie titles according to IMDb, and that’s a lot of words to fill a Radix Tree. Luckily it was very fast except in a few edgecases.
If the prefix was only one or two characters, or was the beginning of ‘the ‘ (lots of movie titles start with ‘the ‘), the tree that was searched using depth first search was particularly big, and took up to a few seconds.
It turned out that my code was blocking the event loop. This was noticable because as a user typed the first character would trigger a search and block the event loop, preventing the next search once the user had typed more characters.
The solution to this wasn’t that complicated, I simply had to use node’s setImmediate function to tell the getWords function to allow the event loop to continue, freeing up other asyncronous code (in this case Express).
To start, I’ll use Node’s promisify function to make setImmediate easier to use.
Reight before the recursive call to the dfs function, we can await the setImmediate function.
This will make each recursive call happen on the next event loop iteration. Thus allowing other asyncronous code to be able to execute.
This will slow the getWords function down a bit, so I ended up only awaiting the setImmediate call on prefixes I could predict being expensive (the edge cases I describe above). But that’s a decision that will have to be made depending on the needs of your code.