String processing

Most of the time, when we’ve talked about the asymptotic performance of data structures, we have implicitly assumed that the keys we are looking up are of constant size. This means that computing a hash function or comparing two keys (as in a binary search tree) takes O(1) time. What if this is not the case?

If we consider an m-character string, any reasonable hash function is going to take O(m) time since it has to look at all of the characters. Similarly, comparing two m-character strings may also take O(m) time. If we charge for this (as we should!) then the cost of hash table operations goes from O(1) to O(m) and the cost of binary search tree operations, even in a balanced tree, goes from O(log n) to O(mlog n). Even sorting becomes more expensive: a sorting algorithm that does O(nlog n) comparisons may now take O(mnlog n) time. But maybe we can exploit the structure of strings to get better performance.

Radix search refers to a variety of data structures that support searching for strings considered as sequences of digits in some large base (or radix). These are generally faster than simple binary search trees because they usually only require examining one byte or less of the target string at each level of the tree, as compared to every byte in the target in a full string comparison. In many cases, the best radix search trees are even faster than hash tables, because they only need to look at a small part of the target string to identify it.

We’ll describe several radix search trees, starting with the simplest and working up.

Tries

A trie is a binary tree (or more generally, a k-ary tree where k is the radix) where the root represents the empty bit sequence and the two children of a node representing sequence x represent the extended sequences x0 and x1 (or generally x0, x1, …, x(k − 1)). So a key is not stored at a particular node but is instead represented bit-by-bit (or digit-by-digit) along some path. Typically a trie assumes that the set of keys is prefix-free, i.e. that no key is a prefix of another; in this case there is a one-to-one correspondence between keys and leaves of the trie. If this is not the case, we can mark internal nodes that also correspond to the ends of keys, getting a slightly different data structure known as a digital search tree. For null-terminated strings as in C, the null terminator ensures that any set of strings is prefix-free.

Given this simple description, a trie storing a single long key would have a very large number of nodes. A standard optimization is to chop off any path with no branches in it, so that each leaf corresponds to the shortest unique prefix of a key. This requires storing the key in the leaf so that we can distinguish different keys with the same prefix.

The name trie comes from the phrase “information re_trie_val.” Despite the etymology, trie is now almost always pronounced like try instead of tree to avoid confusion with other tree data structures.

Searching a trie

Searching a trie is similar to searching a binary search tree, except that instead of doing a comparison at each step we just look at the next bit in the target. The time to perform a search is proportional to the number of bits in the longest path in the tree matching a prefix of the target. This can be very fast for search misses if the target is wildly different from all the keys in the tree.

Inserting a new element into a trie

Insertion is more complicated for tries than for binary search trees. The reason is that a new element may add more than one new node. There are essentially two cases:

  1. (The simple case.) In searching for the new key, we reach a null pointer leaving a non-leaf node. In this case we can simply add a new leaf. The cost of this case is essentially the same as for search plus O(1) for building the new leaf.
  2. (The other case.) In searching for the new key, we reach a leaf, but the key stored there isn’t the same as the new key. Now we have to generate a new path for as long as the old key and the new key have the same bits, branching out to two different leaves at the end. The cost of this operation is within a constant factor of the cost for searching for the new leaf after it is inserted, since that’s how long the newly-built search path will be.

In either case, the cost is bounded by the length of the new key, which is about the best we can hope for in the worst case for any data structure.

Implementation

A typical trie implementation in C might look like this. We reverse the bits within each byte to get the right sorting order for keys.

typedef struct trie_node *Trie;

#define EMPTY_TRIE (0)

/* returns 1 if trie contains target */
int trie_contains(Trie trie, const char *target);

/* add a new key to a trie */
/* and return the new trie */
Trie trie_insert(Trie trie, const char *key);

/* free a trie */
void trie_destroy(Trie);

/* debugging utility: print all keys in trie */
void trie_print(Trie);

examples/trees/trie/trie.h

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>

#include "trie.h"

#define BITS_PER_BYTE (8)

/* extract the n-th bit of x */
/* here we process bits within bytes in MSB-first order */
/* this sorts like strcmp */
#define GET_BIT(x, n) ((((x)[(n) / BITS_PER_BYTE]) & (0x1 << (BITS_PER_BYTE - 1 - (n) % BITS_PER_BYTE))) != 0)

#define TRIE_BASE (2)

struct trie_node {
    char *key;
    struct trie_node *kids[TRIE_BASE];
};

#define IsLeaf(t) ((t)->kids[0] == 0 && (t)->kids[1] == 0)

/* returns 1 if trie contains target */
int
trie_contains(Trie trie, const char *target)
{
    int bit;

    for(bit = 0; trie && !IsLeaf(trie); bit++) {
        /* keep going */
        trie = trie->kids[GET_BIT(target, bit)];
    }

    if(trie == 0) {
        /* we lost */
        return 0;
    } else {
        /* check that leaf really contains the target */
        return !strcmp(trie->key, target);
    }
}

/* gcc -pedantic kills strdup! */
static char *
my_strdup(const char *s)
{
    char *s2;

    s2 = malloc(strlen(s) + 1);
    assert(s2);

    strcpy(s2, s);
    return s2;
}


/* helper functions for insert */
static Trie
make_trie_node(const char *key)
{
    Trie t;
    int i;

    t = malloc(sizeof(*t));
    assert(t);

    if(key) {
        t->key = my_strdup(key);
        assert(t->key);
    } else {
        t->key = 0;
    }

    for(i = 0; i < TRIE_BASE; i++) t->kids[i] = 0;

    return t;
}

/* add a new key to a trie */
/* and return the new trie */
Trie
trie_insert(Trie trie, const char *key)
{
    int bit;
    int bitvalue;
    Trie t;
    Trie kid;
    const char *oldkey;

    if(trie == 0) {
        return make_trie_node(key);
    }
    /* else */
    /* first we'll search for key */
    for(t = trie, bit = 0; !IsLeaf(t); bit++, t = kid) {
        kid = t->kids[bitvalue = GET_BIT(key, bit)];
        if(kid == 0) {
            /* woohoo! easy case */
            t->kids[bitvalue] = make_trie_node(key);
            return trie;
        }
    }

    /* did we get lucky? */
    if(!strcmp(t->key, key)) {
        /* nothing to do */
        return trie;
    }

    /* else */
    /* hard case---have to extend the trie */
    oldkey = t->key;
#ifdef EXCESSIVE_TIDINESS
    t->key = 0;      /* not required but makes data structure look tidier */
#endif

    /* walk the common prefix */
    while(GET_BIT(oldkey, bit) == (bitvalue = GET_BIT(key, bit))) {
        kid = make_trie_node(0);
        t->kids[bitvalue] = kid;
        bit++;
        t = kid;
    }

    /* then split */
    t->kids[bitvalue] = make_trie_node(key);
    t->kids[!bitvalue] = make_trie_node(oldkey);

    return trie;
}

/* kill it */
void
trie_destroy(Trie trie)
{
    int i;

    if(trie) {
        for(i = 0; i < TRIE_BASE; i++) {
            trie_destroy(trie->kids[i]);
        } 

        if(IsLeaf(trie)) {
            free(trie->key);
        }

        free(trie);
    }
}

static void
trie_print_internal(Trie t, int bit)
{
    int i;
    int kid;

    if(t != 0) {
        if(IsLeaf(t)) {
            for(i = 0; i < bit; i++) putchar(' ');
            puts(t->key);
        } else {
            for(kid = 0; kid < TRIE_BASE; kid++) {
                trie_print_internal(t->kids[kid], bit+1);
            }
        }
    }
}

void
trie_print(Trie t)
{
    trie_print_internal(t, 0);
}

examples/trees/trie/trie.c

Here is a short test program that demonstrates how to use it:

#include <stdio.h>
#include <stdlib.h>

#include "trie.h"

/* test for trie.c */
/* reads lines from stdin and echoes lines that haven't appeared before */

/* read a line of text from stdin
 * and return it (without terminating newline) as a freshly-malloc'd block.
 * Caller is responsible for freeing this block.
 * Returns 0 on error or EOF.
 */
char *
getline(void)
{
    char *line;         /* line buffer */
    int n;              /* characters read */
    int size;           /* size of line buffer */
    int c;

    size = 1;
    line = malloc(size);
    if(line == 0) return 0;
    
    n = 0;

    while((c = getchar()) != '\n' && c != EOF) {
        while(n >= size - 1) {
            size *= 2;
            line = realloc(line, size);
            if(line == 0) return 0;
        }
        line[n++] = c;
    }

    if(c == EOF && n == 0) {
        /* got nothing */
        free(line);
        return 0;
    } else {
        line[n++] = '\0';
        return line;
    }
}

int
main(int argc, char **argv)
{
    Trie t;
    char *line;

    t = EMPTY_TRIE;

    while((line = getline()) != 0) {
        if(!trie_contains(t, line)) {
            puts(line);
        }

        /* try to insert it either way */
        /* this tests that insert doesn't blow up on duplicates */
        t = trie_insert(t, line);

        free(line);
    }

    puts("===");

    trie_print(t);

    trie_destroy(t);

    return 0;
}

examples/trees/trie/test_trie.c

Patricia trees

Tries perform well when all keys are short (or are distinguished by short prefixes), but can grow very large if one inserts two keys that have a long common prefix. The reason is that a trie has to store an internal node for every bit of the common prefix until the two keys become distinguishable, leading to long chains of internal nodes each of which has only one child. An optimization (described in this paper) known as a Patricia tree eliminates these long chains by having each node store the number of the bit to branch on, like this:

struct patricia_node {
    char *key;
    int bit;
    struct patricia_node *kids[2];
};

typedef struct patricia_node *Patricia;

Now when searching for a key, instead of using the number of nodes visited so far to figure out which bit to look at, we just read the bit out of the table. This means in particular that we can skip over any bits that we don’t actually branch on. We do however have to be more careful to make sure we don’t run off the end of our target key, since it is possible that when skipping over intermediate bits we might skip over some that distinguish our target from all keys in the table, including longer keys. For example, a Patricia tree storing the strings abc and abd will first test bit position 22, since that’s where abc and abd differ. This can be trouble if we are looking for x.

Here’s the search code:

int
patricia_contains(Patricia t, const char *key)
{
    int key_bits;

    key_bits = BITS_PER_BYTE * (strlen(key)+1);   /* +1 for the NUL */

    while(t && !IsLeaf(t)) {
        if(t->bit >= key_bits) {
            /* can't be there */
            return 0;
        } else {
            t = t->kids[GET_BIT(key, t->bit)];
        }
    }

    return t && !strcmp(t->key, key);
}

The insertion code is similar in many respects to the insertion code for a trie. The differences are that we never construct a long chain of internal nodes when splitting a leaf (although we do have to scan through both the old and new keys to find the first bit position where they differ), but we may sometimes have to add a new internal node between two previously existing nodes if a new key branches off at a bit position that was previously skipped over.

In the worst case Patricia trees are much more efficient than tries, in both space (linear in the number of keys instead of linear in the total size of the keys) and time complexity, often needing to examine only a very small number of bits for misses (hits still require a full scan in strcmp to verify the correct key). The only downside of Patricia trees is that since they work on bits, they are not quite perfectly tuned to the byte or word-oriented structure of modern CPUs.

Ternary search trees

Ternary search trees were described by Jon Bentley and Bob Sedgewick in an article in the April 1988 issue of Dr. Dobb’s Journal, available here.

The basic idea is that each node in the tree stores one character from the key and three child pointers lt, eq, and gt. If the corresponding character in the target is equal to the character in the node, we move to the next character in the target and follow the eq pointer out of the node. If the target is less, follow the lt pointer but stay at the same character. If the target is greater, follow the gt pointer and again stay at the same character. When searching for a string, we walk down the tree until we either reach a node that matches the terminating NUL (a hit), or follow a null pointer (a miss).

A TST acts a bit like a 256-way trie, except that instead of storing an array of 256 outgoing pointers, we build something similar to a small binary search tree for the next character. Note that no explicit balancing is done within these binary search trees. From a theoretical perspective, the worst case is that we get a 256-node deep linked-list equivalent at each step, multiplying our search time by 256 = O(1). In practice, only those characters that actual appear in some key at this stage will show up, so the O(1) is likely to be a small O(1), especially if keys are presented in random order.

TSTs are one of the fastest known data structures for implementing dictionaries using strings as keys, beating both hash tables and tries in most cases. They can be slower than Patricia trees if the keys have many keys with long matching prefixes; however, a Patricia-like optimization can be applied to give a compressed ternary search tree that works well even in this case. In practice, regular TSTs are usually good enough.

Here is a simple implementation of an insert-only TST. The C code includes two versions of the insert helper routine; the first is the original recursive version and the second is an iterative version generated by eliminating the tail recursion from the first.

typedef struct tst_node *TST;

#define EMPTY_TST (0)

/* returns 1 if t contains target */
int tst_contains(TST t, const char *target);

/* add a new key to a TST */
/* and return the new TST */
TST tst_insert(TST t, const char *key);

/* free a TST */
void tst_destroy(TST);

examples/trees/tst/tst.h

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#include "tst.h"

struct tst_node {
    char key;                   /* value to split on */
    struct tst_node *lt;        /* go here if target[index] < value */
    struct tst_node *eq;        /* go here if target[index] == value */
    struct tst_node *gt;        /* go here if target[index] > value */
};

/* returns 1 if t contains key */
int
tst_contains(TST t, const char *key)
{
    assert(key);

    while(t) {
        if(*key < t->key) {
            t = t->lt;
        } else if(*key > t->key) {
            t = t->gt;
        } else if(*key == '\0') {
            return 1;
        } else {
            t = t->eq;
            key++;
        }
    }

    return 0;
}

/* original recursive insert */
static void
tst_insert_recursive(TST *t, const char *key)
{
    if(*t == 0) {
        *t = malloc(sizeof(**t));
        assert(*t);
        (*t)->key = *key;
        (*t)->lt = (*t)->eq = (*t)->gt = 0;
    }

    /* now follow search */
    if(*key < (*t)->key) {
        tst_insert_recursive(&(*t)->lt, key);
    } else if(*key > (*t)->key) {
        tst_insert_recursive(&(*t)->gt, key);
    } else if(*key == '\0') {
        /* do nothing, we are done */
        ;
    } else {
        tst_insert_recursive(&(*t)->eq, key+1);
    }
}

/* iterative version of above, since somebody asked */
/* This is pretty much standard tail-recursion elimination: */
/* The whole function gets wrapped in a loop, and recursive
 * calls get replaced by assignment */
static void
tst_insert_iterative(TST *t, const char *key)
{
    for(;;) {
        if(*t == 0) {
            *t = malloc(sizeof(**t));
            assert(*t);
            (*t)->key = *key;
            (*t)->lt = (*t)->eq = (*t)->gt = 0;
        }

        /* now follow search */
        if(*key < (*t)->key) {
            t = &(*t)->lt;
        } else if(*key > (*t)->key) {
            t = &(*t)->gt;
        } else if(*key == '\0') {
            /* do nothing, we are done */
            return;
        } else {
            t = &(*t)->eq;
            key++;
        }
    }
}


/* add a new key to a TST */
/* and return the new TST */
TST
tst_insert(TST t, const char *key)
{
    assert(key);

#ifdef USE_RECURSIVE_INSERT
    tst_insert_recursive(&t, key);
#else
    tst_insert_iterative(&t, key);
#endif
    return t;
}

/* free a TST */
void
tst_destroy(TST t)
{
    if(t) {
        tst_destroy(t->lt);
        tst_destroy(t->eq);
        tst_destroy(t->gt);
        free(t);
    }
}

examples/trees/tst/tst.c

And here is some test code, almost identical to the test code for tries: test_tst.c.

The Dr. Dobb’s article contains additional code for doing deletions and partial matches, plus some optimizations for inserts.

More information

Radix sort

The standard quicksort routine is an example of a comparison-based sorting algorithm. This means that the only information the algorithm uses about the items it is sorting is the return value of the compare routine. This has a rather nice advantage of making the algorithm very general, but has the disadvantage that the algorithm can extract only one bit of information from every call to compare. Since there are n! possible ways to reorder the input sequence, this means we need at least log (n!) = O(nlog n) calls to compare to finish the sort. If we are sorting something like strings, this can get particularly expensive, because calls to strcmp may take time linear in the length of the strings being compared. In the worst case, sorting n strings of length m each could take O(nmlog n) time.

Bucket sort

The core idea of radix sort is that if we want to sort values from a small range, we can do it by making one bucket for each possible value and throw any object with that value into the corresponding bucket. In the old days, when Solitaire was a game played with physical pieces of cardboard, a player who suspected that that one of these “cards” had fallen under the couch might sort the deck by dividing it up into Spades, Hearts, Diamonds, and Club piles and then sorting each pile recursively. The same trick works in a computer, but there the buckets are typically implemented as an array of some convenient data structure.

If the number of possible values is too big, we may still be able to use bucket sort digit-by-digit. The resulting algorithms are known generally as radix sort. These are a class of algorithms designed for sorting strings in lexicographic order—the order used by dictionaries where one string is greater than another if the first character on which they differ is greater. One particular variant, most-significant-byte radix sort or MSB radix sort, has the beautiful property that its running time is not only linear in the size of the input in bytes, but is also linear in the smallest number of characters in the input that need to be examined to determine the correct order. This algorithm is so fast that it takes not much more time to sort data than it does to read the data from memory and write it back. But it’s a little trickier to explain that the original least-significant-byte radix sort or LSB radix sort.

Classic LSB radix sort

This is the variant used for punch cards, and works well for fixed-length strings. The idea is to sort on the least significant position first, then work backwards to the most significant position. This works as long as each sort is stable, meaning that it doesn’t reorder values with equal keys. For example, suppose we are sorting the strings:

sat
bat
bad

The first pass sorts on the third column, giving:

bad
sat
bat

The second pass sorts on the second column, producing no change in the order (all the characters are the same). The last pass sorts on the first column. This moves the s after the two bs, but preserves the order of the two words starting with b. The result is:

bad
bat
sat

There are three downsides to LSB radix sort:

  1. All the strings have to be the same length (this is not necessarily a problem if they are really fixed-width data types like ints).
  2. The algorithm used to sort each position must be stable, which may require more effort than most programmers would like to take.
  3. It may be that the late positions in the strings don’t affect the order, but we have to sort on them anyway. If we are sorting aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa and baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, we spend a lot of time matching up as against each other.

MSB radix sort

For these reasons, MSB radix sort is used more often. This is basically the radix sort version of quicksort, where instead of partitioning our input data into two piles based on whether each element is less than or greater than a random pivot, we partition the input into 256 piles, one for each initial byte. We can then sort each pile recursively using the same algorithm, taking advantage of the fact that we know that the first byte (or later, the first k bytes) are equal and so we only need to look at the next one. The recursion stops when we get down to 1 value, or in practice where we get down to a small enough number of values that the cost of doing a different sorting algorithm becomes lower than the cost of creating and tearing down the data structures for managing the piles.

Issues with recursion depth

The depth of recursion for MSB radix sort is equal to the length of the second-longest string in the worst case. Since strings can be pretty long, this creates a danger of blowing out the stack. The solution (as in quicksort) is to use tail recursion for the largest pile. Now any pile we recurse into with an actual procedure call is at most half the size of the original pile, so we get stack depth at most O(log n).

Implementing the buckets

There is a trick we can do analagous to the Dutch flag algorithm where we sort the array in place. The idea is that we first count the number of elements that land in each bucket and assign a block of the array for each bucket, keeping track in each block of an initial prefix of values that belong in the bucket with the rest not yet processed. We then walk through the buckets swapping out any elements at the top of the good prefix to the bucket they are supposed to be in. This procedure puts at least one element in the right bucket for each swap, so we reorder everything correctly in at most n swaps and O(n) additional work.

To keep track of each bucket, we use two pointers bucket[i] for the first element of the bucket and top[i] for the first unused element. We could make these be integer array indices, but this slows the code down by about 10%. This seems to be a situation where our use of pointers is complicated enough that the compiler can’t optimize out the array lookups.

Further optimization

Since we are detecting the heaviest bucket anyway, there is a straightforward optimization that speeds the sort up noticeably on inputs with a lot of duplicates: if everything would land in the same bucket, we can skip the bucket-sort and just go directly to the next character.

Sample implementation

Here is an implementation of MSB radix sort using the ideas above:

#include <assert.h>
#include <limits.h>
#include <string.h>

#include "radixSort.h"

/* in-place MSB radix sort for null-terminated strings */

/* helper routine for swapping */
static void
swapStrings(const char **a, const char **b)
{
    const char *temp;

    temp = *a;
    *a = *b;
    *b = temp;
}

/* this is the internal routine that assumes all strings are equal for the
 * first k characters */
static void
radixSortInternal(int n, const char **a, int k)
{
    int i;
    int count[UCHAR_MAX+1];  /* number of strings with given character in position k */
    int mode;                /* most common position-k character */
    const char **bucket[UCHAR_MAX+1]; /* position of character block in output */
    const char **top[UCHAR_MAX+1];    /* first unused index in this character block */

    /* loop implements tail recursion on most common character */
    while(n > 1) {

        /* count occurrences of each character */
        memset(count, 0, sizeof(int)*(UCHAR_MAX+1));

        for(i = 0; i < n; i++) {
            count[(unsigned char) a[i][k]]++;
        }

        /* find the most common nonzero character */
        /* we will handle this specially */
        mode = 1;
        for(i = 2; i < UCHAR_MAX+1; i++) {
            if(count[i] > count[mode]) {
                mode = i;
            }
        }

        if(count[mode] < n) {

            /* generate bucket and top fields */
            bucket[0] = top[0] = a;
            for(i = 1; i < UCHAR_MAX+1; i++) {
                top[i] = bucket[i] = bucket[i-1] + count[i-1];
            }

            /* reorder elements by k-th character */
            /* this is similar to dutch flag algorithm */
            /* we start at bottom character and swap values out until everything is in place */
            /* invariant is that for all i, bucket[i] <= j < top[i] implies a[j][k] == i */
            for(i = 0; i < UCHAR_MAX+1; i++) {
                while(top[i] < bucket[i] + count[i]) {
                    if((unsigned char) top[i][0][k] == i) {
                        /* leave it in place, advance bucket */
                        top[i]++;
                    } else {
                        /* swap with top of appropriate block */
                        swapStrings(top[i], top[(unsigned char) top[i][0][k]]++);
                    }
                }
            }

            /* we have now reordered everything */
            /* recurse on all but 0 and mode */
            for(i = 1; i < UCHAR_MAX+1; i++) {
                if(i != mode) {
                    radixSortInternal(count[i], bucket[i], k+1);
                }
            }

            /* tail recurse on mode */
            n = count[mode];
            a = bucket[mode];
            k = k+1;

        } else {

            /* tail recurse on whole pile */
            k = k+1;
        }
    }
}

void
radixSort(int n, const char **a)
{
    radixSortInternal(n, a, 0);
}

examples/sorting/radixSort/radixSort.c

Some additional files: radixSort.h, test_radixSort.c, Makefile, sortInput.c. The last is a program that sorts lines on stdin and writes the result to stdout, similar to the GNU sort utility. When compiled with -O3 and run on my machine, this runs in about the same time as the standard sort program when run on a 4.7 million line input file consisting of a random shuffle of 20 copies of /usr/share/dict/words, provided sort is run with LANG=C sort < /usr/share/dict/words to keep it from having to deal with locale-specific collating issues. On other inputs, sort is faster. This is not bad given how thoroughly sort has been optimized, but is a sign that further optimization is possible.


Licenses and Attributions


Speak Your Mind

-->