Input description: A text string t of length n. A pattern string p of length m.
Problem description: Find the first (or all) instances of the pattern in the text.
Discussion: String matching is fundamental to database and text processing applications. Every text editor must contain a mechanism to search the current document for arbitrary strings. Pattern matching programming languages such as Perl and Awk derive much of their power from their built-in string matching primitives, making it easy to fashion programs that filter and modify text. Spelling checkers scan an input text for words in the dictionary and reject any strings that do not match.
Several issues arise in identifying the right string matching algorithm for the job:
For very short patterns, say , you can't hope to beat brute force by much, if at all, and you shouldn't try. Further, since we can reject the possibility of a match at a given starting position the instant we observe a text/pattern mismatch, we expect much better than O(mn) behavior for typical strings. Indeed, the trivial algorithm usually runs in linear time. But the worst case certainly can occur, as with pattern and text .
Even better in practice is the Boyer-Moore algorithm, although it offers similar worst-case performance. Boyer-Moore matches the pattern against the text from right to left, in order to avoid looking at large chunks of text on a mismatch. Suppose the pattern is abracadabra, and the eleventh character of the text is x. This pattern cannot match in any of the first eleven starting positions of the text, and so the next necessary position to test is the 22nd character. If we get very lucky, only n/m characters need ever be tested. The Boyer-Moore algorithm involves two sets of jump tables in the case of a mismatch: one based on pattern matched so far, the other on the text character seen in the mismatch. Although somewhat more complicated than Knuth-Morris-Pratt, it is worth it in practice for patterns of length m > 5.
Performing a linear-time search for each of these patterns yields an O(k (m+n)) algorithm. If k is large, a better solution builds a single finite automaton that recognizes each of these patterns and returns to the appropriate start state on any character mismatch. The Aho-Corasick algorithm builds such an automaton in linear time. Space savings can be achieved by optimizing the pattern recognition automaton, as discussed in Section . This algorithm was used in the original version of fgrep.
Sometimes multiple patterns are specified not as a list of strings, but concisely as a regular expression. For example, the regular expression matches any string on (a,b,c) that begins and ends with an a, including a itself. The best way to test whether input strings are described by a given regular expression R is to construct the finite automaton equivalent to R and then simulate the machine on the string. Again, see Section for details on constructing automata from regular expressions.
Implementations: SPARE Parts is a string pattern recognition toolkit, written in C++ by Bruce Watson. It provides production-quality implementations of all major variants of the classical string matching algorithms for single patterns (both Knuth-Morris-Pratt and Boyer-Moore) and multiple patterns (both Aho-Corasick and Commentz-Walter). SPARE Parts is available by anonymous ftp from ftp.win.tue.nl in /pub/techreports/pi/watson.phd/. A greatly improved commercial version is available from www.RibbitSoft.com.
XTango (see Section ) provides animations for both the Boyer-Moore and Knuth-Morris-Pratt algorithms. The C source code for each animation is included.
Implementations in C and Pascal of several algorithms for exact and approximate string matching appear in [GBY91]. Sedgewick provides similar implementations of Knuth-Morris-Pratt, Rabin-Karp, and Boyer-Moore in C++. See Section for details on both codes.
Implementations in C of the Boyer-Moore, Aho-Corasick, and regular expression matching algorithms appear in [BR95]. The code for these algorithms is printed in the text and available on disk for a modest fee.
Notes: All books on string algorithms contain thorough discussions of exact string matching, including [CR94, Ste94, Gus97]. Good expositions on the Boyer-Moore [BM77] and Knuth-Morris-Pratt algorithms [KMP77] include [Baa88, CLR90, Man89].
Aho [Aho90] provides a good survey on algorithms for pattern matching in strings, particularly where the patterns are regular expressions instead of strings, and for the Aho-Corasick algorithm for multiple patterns [AC75]. An algorithm merging Aho-Corasick and Boyer-Moore can be faster for small numbers of patterns [CW79], but the window where it wins is apparently fairly narrow.
Empirical comparisons of string matching algorithms include [DB86, Hor80, dVS82]. Which algorithm performs best depends upon the properties of the strings and the size of the alphabet. For long patterns and texts, I recommend that you use the best implementation of Boyer-Moore that you can find.
An interesting classical problem is determining the minimum number of comparisons needed to perform exact string matching. A version of Boyer-Moore never makes more than 2n comparisons independent of the number of occurrences of the pattern in the text [AG86]. More recent results are very technical, depending upon the details of the model and the alphabet size. There is a lower bound of n-m+1 text characters that any algorithm must be examine in the worst case [Riv77]. The history of string matching algorithms is somewhat checkered because several published proofs were incorrect or incomplete [Gus97].
The Karp-Rabin algorithm [KR87] uses a hash function to perform string matching in linear expected time. Its worst-case time remains quadratic, and its performance in practice appears somewhat worse than the character comparison methods described above.
Related Problems: Suffix trees (see page ), approximate string matching (see page ).