Next: Searching Up: Combinatorial Problems Previous: Combinatorial Problems

## Sorting

Input description: A set of n items.

Problem description: Arrange the items in increasing order.

Discussion: Sorting is the fundamental algorithmic problem in computer science. Learning the different sorting algorithms is like learning scales for a musician.   Sorting is the first step in solving a host of other algorithm problems, as discussed in Section . Indeed, ``when in doubt, sort'' is one of the first rules of algorithm design.

Sorting is also used to illustrate the standard paradigms of algorithm design.   The result is that most programmers are familiar with many different sorting algorithms, which sows confusion as to which should be used for a given application. The following criteria can help you decide:

• How many keys will you be sorting? - For small amounts of data (say ), it really doesn't matter much which of the quadratic-time algorithms you use. Insertion sort is faster, simpler, and less likely to be buggy than bubblesort.       Shellsort is much faster than insertion sort, but it involves somewhat trickier programming and looking up good insert sequences in Knuth [Knu73b].

If you have more than 100 items to sort, it is important to use an -time algorithm, like heapsort, quicksort, or mergesort.     If you have more than 1,000,000 items to sort, you probably need an external-memory algorithm that minimizes disk access.   Both types of algorithms are discussed below.

• Will there be duplicate keys in the data? -   When all items have distinct keys, the sorted order is completely defined. However, when two items share the same key, something else must determine which one comes first. For many applications it doesn't matter, so any sorting algorithm is equally good.   Often, ties are broken by sorting on a secondary key, like the first name or initial if the family names collide.

Occasionally, ties need to be broken by their initial position in the data set. If the 5th and 27th items of the initial data set share the same key in such a case, the 5th item must be before the 27th in the final order.   A stable sorting algorithm preserves the original ordering in case of ties. Most of the quadratic-time sorting algorithms are stable, while many of the algorithms are not. If it is important that your sort be stable, it is probably better to explicitly use the initial position as a secondary key rather than trust the stability of your implementation.

• What do you know about your data? - In special applications, you can often exploit knowledge about your data to get it sorted faster or more easily. Of course, general sorting is a fast algorithm, so if the time spent sorting is really the bottleneck in your application, you are a fortunate person indeed.
• Is the data already partially sorted? If so, algorithms like insertion sort perform better than they otherwise would.
• Do you know the distribution of the keys?     If the keys are randomly or uniformly distributed, a bucket or distribution sort makes sense. Throw the keys into bins based on their first letter, and recur until each bin is small enough to sort by brute force.   This is very efficient when the keys get evenly distributed into buckets. However, bucket sort would be bad news sorting names on the mailing list of the ``Smith Society.''
• Are your keys very long or hard to compare?   If your keys are long text strings, it might pay to use a radix or bucket sort instead of a standard comparison sort, because the time of each comparison can get expensive.   A radix sort always takes time linear in the number of characters in the file, instead of times the cost of comparing two keys.
• Is the range of possible keys very small?   If you want to sort a subset of n/2 distinct integers, each with a value from 1 to n, the fastest algorithm would be to initialize an n-element bit vector, turn on the bits corresponding to keys, then scan from left to right and report which bits are on.
• Do I have to worry about disk accesses? - In massive sorting problems, it may not be possible to keep all data in memory simultaneously. Such a problem is called external sorting, because one must use an external storage device.     Traditionally, this meant tape drives, and Knuth [Knu73b] describes a variety of intricate algorithms for efficiently merging data from different tapes.   Today, it usually means virtual memory and swapping.   Any sorting algorithm will work with virtual memory, but most will spend all their time swapping.

The simplest approach to external sorting loads the data into a B-tree (see Section ) and then does an in-order traversal of the tree to read the keys off in sorted order.   Other approaches are based on mergesort. Files containing portions of the data are sorting using a fast internal sort, and then these files are merged in stages using 2- or k-way merging. Complicated merging patterns based on the properties of the external storage device can be used to optimize performance.

• How much time do you have to write and debug your routine? -     If I had under an hour to deliver a working routine, I would probably just use a simple selection sort.   If I had an afternoon to build an efficient sort routine, I would probably use heapsort, for it delivers reliable performance without tuning. If I was going to take the time required to build a fast system sort routine, I would carefully implement quicksort.

The best general-purpose sorting algorithm is quicksort (see Section ), although it requires considerable tuning effort to achieve maximum performance.   Indeed, you are probably better off using a library function instead of doing it yourself. A poorly written quicksort will likely run more slowly than a poorly written heapsort.

If you are determined to implement your own quicksort, use the following heuristics, which make a big difference in practice:

• Use randomization -   By randomly permuting (see Section ) the keys before sorting, you can eliminate the potential embarrassment of quadratic-time behavior on nearly-sorted data.
• Median of three -   For your pivot element, use the median of the first, last, and middle elements of the array, to increase the likelihood of partitioning the array into roughly equal pieces. Some experiments suggest using a larger sample on big subarrays and a smaller sample on small ones.
• Leave small subarrays for insertion sort -   Terminating the quicksort recursion and switching to insertion sort makes sense when the subarrays get small, say fewer than 20 elements. You should experiment to determine the best switchpoint for your implementation.
• Do the smaller partition first -   Assuming that your compiler is smart enough to remove tail recursion, you can minimize runtime memory by processing the smaller partition before the larger one.   Since successive stored calls are at most half as large as before, only stack space is needed.

Before you get started, see Bentley's article on building a faster quicksort [Ben92b].

Implementations: Pascal implementations of all the primary sorting algorithms are available from [MS91]. See Section for details. Timing comparisons show an optimized version of quicksort to be the winner.

Bare bones implementations of all basic sorting algorithms, in C and Pascal, appear in [GBY91].   Most notable is the inclusion of implementations of external memory sorting algorithms. Sedgewick includes similar sort routine fragments in C++.   See Section for details.

XTango (see Section ) is an algorithm animation system for UNIX and X-windows, which includes animations of all the basic sorting algorithms, including bubblesort, heapsort, mergesort, quicksort, radix sort, and shellsort. Many of these are quite interesting to watch. Indeed, sorting is the canonical problem for algorithm animation.

Algorithm 410 [Cha71] and Algorithm 505 [Jan76] of the Collected Algorithms of the ACM are Fortran codes for sorting. The latter is an implementation of Shellsort on linked lists. Both are available from Netlib (see Section ).

C language implementations of Shellsort, quicksort, and heapsort appear in [BR95]. The code for these algorithms is printed in the text and available on disk for a modest fee.

A bare bones implementation of heapsort in Fortran from [NW78] can be obtained in Section .     A bare bones implementation of heapsort in Mathematica from [Ski90] can be obtained in Section .

Notes: Knuth [Knu73b] is the best book that has been written on sorting and indeed is the best book that will ever be written on sorting. It is now almost twenty-five years old, and a revised edition is promised, but it remains fascinating reading. One area that has developed since Knuth is sorting under presortedness measures.   A newer and noteworthy reference on sorting is [GBY91], which includes pointers to algorithms for partially sorted data and includes implementations in C and Pascal for all of the fundamental algorithms.

Expositions on the basic internal sorting algorithms appear in every algorithms text, including [AHU83, Baa88, CLR90, Man89]. Treatments of external sorting are rarer but include [AHU83]. Heapsort was first invented by Williams [Wil64]. Quicksort was invented by Hoare [Hoa62], with careful analysis and implementation by Sedgewick [Sed78].   Von Neumann is credited with having produced the first implementation of mergesort, on the EDVAC in 1945.   See Knuth for a full discussion of the history of sorting, dating back to the days of punched-card tabulating machines.

Sorting has a well-known lower bound under the algebraic decision tree model [BO83].   Determining the exact number of comparisons required for sorting n elements, for small values of n, has generated considerable study. See [Aig88, Raw92] for expositions.

This lower-bound does not hold under different models of computation.   Fredman and Willard [FW93] present an algorithm for sorting under a model of computation that permits arithmetic operations on keys.   Under a similar model, Thorup [Tho96] developed a priority queue supporting operations, implying an sorting algorithm.

Related Problems: Dictionaries (see page ), searching (see page ), topological sorting (see page ).

Next: Searching Up: Combinatorial Problems Previous: Combinatorial Problems

Algorithms
Mon Jun 2 23:33:50 EDT 1997