Open Source For Geeks: 05/08/16

Sunday, 8 May 2016

Understanding Tries data structure

Standard Tries

The standard Trie for a set of strings S is an ordered tree such that

Each node but the node is labelled with a character
The children of a node are alphabetically ordered
The path from external nodes to the root yield the String of S.

For example see following picture -

How would you represent this in code? Each node will have an array of size 26 (assuming all english characters) where each index of array will store pointer to next node which is essentially next character of the string.

Running time for operations

A standard trie uses O(W) space.
Operations find, insert, delete take time O(dm) each where

w = total size of strings in S,
m = size of string involved in operation
d = alphabet size

Compressed Tries

Trie with nodes of degree at least 2
Obtained from standart trie by compressing chains of redundant nodes.

Applications

Trie has many useful applications like

Find first prefix in a text (pattern matching)
or auto complete a text (which is why it can be used in a search engine). The index of a search engine is stored into a compressed trie. Each leaf of trie is associated with a word and has a list of pages (URLs) containing that word called occurrence list. Trie is kept in memory while the occurrence lists are kept in external memory and are ranked by relevance.
Suffix tree (rooted directed tree whose edges are labelled with non empty substrings of S)
The suffix tree for a text X of size N from an alphabet of size d stores all the N suffixes of X is O(N) space. NOTE : There might be a case where a suffix is a prefix of another suffix in which case the suffix will end on an internal mode. To counter this case you can end the alphabet with a special character eg. $. Time complexity to build suffix tree - O(N^2)