Input description: A set of strings .
Problem description: Find the shortest string S' that contains each element of S as a substring.
Discussion: Shortest common superstring arises in a variety of applications, including sparse matrix compression. Suppose we have an matrix with most of the elements being zero. We can partition each row into m/k runs of k elements each and construct the shortest common superstring S' of these runs. We now have reduced the problem to storing the superstring, plus an array of pointers into the superstring denoting where each of the runs starts. Accessing a particular element M[i,j] still takes constant time, but there is a space savings when |S| << mn.
Another application arises in DNA sequencing. It happens to be easy to sequence small fragments of DNA, say up to about 500 base pairs or characters. However, the real interest is in sequencing large molecules. The standard approach to large-scale, ``shotgun'' sequencing clones many copies of the target molecule, breaks them randomly into small fragments, sequences the fragments, and then proposes the shortest superstring of the fragments as the correct sequence. While it is an article of faith that the shortest superstring will be the most likely sequence, this seems to work reasonably well in practice.
Finding a superstring of all the substrings is not difficult, as we can simply concatenate them all together. It is finding the shortest such string that is problematic. Indeed, shortest common superstring remains NP-complete under all reasonable classes of strings.
The problem of finding the shortest common superstring can easily be reduced to that of the traveling salesman problem (see Section ). Create an overlap graph G where vertex represents string . Edge will have weight equal to the length of minus the overlap of with . The path visiting all the vertices of minimum total weight defines the shortest common superstring. The edge weights of this graph are not symmetric, after all, the overlap of and is not the same as the overlap of and . Thus only programs capable of solving asymmetric TSPs can be applied to this problem.
The greedy heuristic is the standard approach to approximating the shortest common superstring. Find the pair of strings with the maximum number of characters of overlap. Replace them by the merged string, and repeat until only one string remains. Given the overlap graph above, this heuristic can be efficiently implemented by inserting all of the edge weights into a heap (see Section ) and then merging if the appropriate ends of the two strings have not yet be used, which can be maintained with an array of Boolean flags.
The potentially time-consuming part of this heuristic is in building the overlap graph. The brute-force approach to finding the maximum overlap of two length-l strings takes , which must be repeated times. Faster times are possible by appropriately using suffix trees (see Section ). Build a tree containing all suffixes of all reversed strings of S. String overlaps with if a suffix of matches a suffix of the reverse of . The longest overlap for each fragment can be found in time linear in its length.
How well does the greedy heuristic perform? If we are unlucky with the input, the greedy heuristic can be fooled into creating a superstring that is at least twice as long as optimal. Usually, it will be a lot better in practice. It is known that the resulting superstring can never be more than 2.75 times optimal.
Building superstrings becomes more difficult with positive and negative substrings, where negative substrings cannot be substrings of the superstring. The problem of deciding whether any such consistent substring exists is NP-complete, unless you are allowed to add an extra character to the alphabet to use as a spacer.
Implementations: CAP (Contig Assembly Program) [Hua92] by Xiaoqiu Huang is a C language program supporting DNA shotgun sequencing by finding the shortest common superstring of a set of fragments. As to performance, CAP took 4 hours to assemble 1,015 fragments of a total of 252,000 characters on a Sun SPARCstation SLC. Certain parameters will need to be tweaked to make it accommodate non-DNA data. It is available by anonymous ftp from cs.mtu.edu in the pub/huang directory.
Notes: The shortest common superstring problem and its application to DNA shotgun assembly is ably surveyed in [Wat95]. Kececioglu and Myers [KM95] report on an algorithm for the more general version of shortest common superstring, where the strings may have errors. Their paper is recommended reading to anyone interested in fragment assembly.
Blum et al. [BJL 91] gave the first constant-factor approximation algorithms for shortest common superstring, with a variation of the greedy heuristic. More recent research has beaten the constant down to 2.75, progress towards the expected factor-two result.
Related Problems: Suffix trees (see page ), text compression (see page ).