Input description: A text string t and a pattern string p. An edit cost bound k.
Problem description: Can we transform t to p using at most k insertions, deletions, and substitutions?
Discussion: Approximate string matching is fundamental to text processing, because we live in an error-prone world. Any spelling correction program must be able to identify the closest match for any text string not found in a dictionary. Genbank has become a fundamental tool for molecular biology by supporting homology (similarity) searches on DNA sequences. Suppose you were to sequence a new gene in man, and you discovered that it is similar to the hemoglobin gene in rats. It is likely that this new gene also produces hemoglobin, and any differences are the result of genetic mutations during evolution.
I once encountered approximate string matching in evaluating the performance of an optical character recognition system that we built. After scanning and recognizing a test document, we needed to compare the correct answers with those produced by our system. To improve our system, it was important to count how often each pair of letters were getting confused and to identify gibberish when the program was trying to make out letters where none existed. The solution was to do an alignment between the two texts. Insertions and deletions corresponded to gibberish, while substitutions signaled errors in our recognizers. This same principle is used in file difference programs, which identify the lines that have changed between two versions of a file.
When no errors are permitted, the problem becomes that of exact string matching, which is presented in Section . Here, we restrict our discussion to dealing with errors.
Dynamic programming provides the basic approach to approximate string matching, as discussed in Section . Let D[i,j] denote the cost of editing the first i characters of the text string t into the first j characters of the pattern string p. The recurrence follows because we must have done something with and . The only options are matching them, substituting one for the other, deleting , or inserting a match for . Thus D[i,j] is the minimum of the costs of these possibilities:
Several issues remain before we can make full use of this recurrence:
Now suppose that the pattern may occur anywhere within the text. The proper cost of D[i,0] is now 0, since there should be no penalty for starting the alignment in the ith position. The cost of D[0,j] remains j, however, since the only way to match the first j pattern characters with nothing is to delete all of them. The cost of the best pattern match against the text will be given by .
The most common cost assignment charges the same for insertion, deletion, or substitution. Charging a substitution cost of more than insertion + deletion is a good way to ensure that substitutions never get performed, since it will be cheaper to edit both characters out of the string. With just insertion and deletion to work with, the problem reduces to longest common subsequence, discussed in Section . Often, it pays to tweak the edit distance costs and study the resulting alignments until you find a set that does the job.
To save space, we can use Hirschberg's clever recursive algorithm. During one pass of the linear-space algorithm above to compute D[m,n], we maintain all the values for the (m/2)nd column and identify which middle-element cell D[m/2,x] was used to optimize D[m,n]. This reduces our problem to finding the best paths from D[1,1] to D[m/2,x] and from D[m/2,x] to D[m/2,n], both of which can be solved recursively. Each time we throw away half of the matrix elements from consideration, and so the total time remains O(mn). This linear-space algorithm proves to be a big win in practice on long strings, although it is somewhat more difficult to program.
The algorithm works by dropping vowels and silent letters, removing doubled letters, and then assigning the remaining letters numbers from the following classes: BFPV gets a 1, CGJKQSXZ gets a 2, DT gets a 3, L gets a 4, MN gets a 5, and R gets a 6. The code starts with the first letter and contains at most three digits. Although this all sounds very hokey, experience shows that it works reasonably well. Experience indeed: Soundex has been used since the 1920's.
Implementations: The best available software tools for approximate pattern matching are glimpse and agrep [WM92a, WM92b], developed by Manber and Wu at the University of Arizona and available from http://glimpse.cs.arizona.edu/. Glimpse is a tool for building and using an index to search through file systems, while agrep (approximate general regular expression pattern matcher) is a tool supporting text search with spelling errors. Both programs are widely used and respected.
ht://Dig is an alternative WWW text search engine from Andrew Scherpbier, which contains implementations of Soundex and Metaphone. It is available from http://htdig.sdsu.edu/ and is released under the GNU general public license.
Implementations in C of the Soundex and dynamic programming edit-distance algorithms appear in [BR95]. The code for these algorithms is printed in the text and is available on disk for a modest fee.
Bare bones implementations in C and Pascal of several algorithms for exact and approximate string matching appear in [GBY91]. See Section for further details.
Notes: The wide range of applications for approximate string matching was made apparent in Sankoff and Kruskal's book [SK83], which remains a useful historical reference for the problem. Surveys on approximate pattern matching include [HD80]. The basic dynamic programming alignment algorithm is attributed to [WF74], although it is apparently folklore. The edit distance between two strings is sometimes referred to as the Levenshtein distance. Expositions of dynamic programming to compute Levenshtein distance include [Baa88, CLR90, Man89]. Expositions of Hirschberg's linear-space algorithm [Hir75] include [CR94, Gus97].
Masek and Paterson [MP80] compute the edit distance between m- and n-length strings in time for constant-sized alphabets, using ideas from the four Russians algorithm for Boolean matrix multiplication [ADKF70]. The shortest path formulation leads to a variety of algorithms that are good when the edit distance is small, including an algorithm due to Myers [Mye86]. Longest increasing subsequence can be done in time [HS77], as presented in [Man89].
Soundex was invented and patented by M. K. Odell and R. C. Russell. Expositions on Soundex include [BR95, Knu73b]. Metaphone is a recent attempt to improve on Soundex [BR95, Par90].
Related Problems: String matching (see page ), longest common substring (see page ).