       Next: Lecture 7 - elementary Up: No Title Previous: Lecture 5 - quicksort

# Lecture 6 - linear sorting

7.1-2: Show that an n-element heap has height .

Since it is balanced binary tree, the height of a heap is clearly , but the problem asks for an exact answer.

The height is defined as the number of edges in the longest simple path from the root. The number of nodes in a complete balanced binary tree of height h is .

Thus the height increases only when , or in other words when is an integer.

7.1-5 Is a reverse sorted array a heap?

In a heap, each element is greater than or equal to each of its descendants.

In the array representation of a heap, the descendants of the ith element are the 2ith and (2i+1)th elements.

If A is sorted in reverse order, then implies that .

Since 2i > i and 2i+1 > i then and .

Thus by definition A is a heap!

Can we sort in better than ?

Any comparison-based sorting program can be thought of as defining a decision tree of possible executions.

Running the same program twice on the same permutation causes it to do exactly the same thing, but running it on different permutations of the same data causes a different sequence of comparisons to be made on each. Claim: the height of this decision tree is the worst-case complexity of sorting.

Once you believe this, a lower bound on the time complexity of sorting follows easily.

Since any two different permutations of n elements requires a different sequence of steps to sort, there must be at least n! different paths from the root to leaves in the decision tree, ie. at least n! different leaves in the tree.

Since only binary comparisons (less than or greater than) are used, the decision tree is a binary tree.

Since a binary tree of height h has at most leaves, we know , or .

By inspection , since the last n/2 terms of the product are each greater than n/2. By Sterling's approximation, a better bound is where e=2.718. Non-Comparison-Based Sorting

All the sorting algorithms we have seen assume binary comparisons as the basic primative, questions of the form ``is x before y?''.

Suppose you were given a deck of playing cards to sort. Most likely you would set up 13 piles and put all cards with the same number in one pile.

A 2 3 4 5 6 7 8 9 10 J Q K

A 2 3 4 5 6 7 8 9 10 J Q K

A 2 3 4 5 6 7 8 9 10 J Q K

A 2 3 4 5 6 7 8 9 10 J Q K

With only a constant number of cards left in each pile, you can use insertion sort to order by suite and concatenate everything together.

If we could find the correct pile for each card in constant time, and each pile gets O(1) cards, this algorithm takes O(n) time.

Bucketsort

Suppose we are sorting n numbers from 1 to m, where we know the numbers are approximately uniformly distributed.

We can set up n buckets, each responsible for an interval of m/n numbers from 1 to m Given an input number x, it belongs in bucket number .

If we use an array of buckets, each item gets mapped to the right bucket in O(1) time.

With uniformly distributed keys, the expected number of items per bucket is 1. Thus sorting each bucket takes O(1) time!

The total effort of bucketing, sorting buckets, and concatenating the sorted buckets together is O(n).

What happened to our lower bound!

We can use bucketsort effectively whenever we understand the distribution of the data.

However, bad things happen when we assume the wrong distribution.

Suppose in the previous example all the keys happened to be 1. After the bucketing phase, we have: We spent linear time distributing our items into buckets and learned nothing. Perhaps we could split the big bucket recursively, but it is not certain that we will ever win unless we understand the distribution.

Problems like this are why we worry about the worst-case performance of algorithms!

Such distribution techniques can be used on strings instead of just numbers. The buckets will correspond to letter ranges instead of just number ranges.

The worst case ``shouldn't'' happen if we understand the distribution of our data.

Real World Distributions

Consider the distribution of names in a telephone book.

• Will there be a lot of Skiena's?
• Will there be a lot of Smith's?
• Will there be a lot of Shifflett's?

Either make sure you understand your data, or use a good worst-case or randomized algorithm!

The Shifflett's of Charlottesville

For comparison, note that there are seven Shifflett's (of various spellings) in the 1000 page Manhattan telephone directory. Listen To Part 6-10

Rules for Algorithm Design

The secret to successful algorithm design, and problem solving in general, is to make sure you ask the right questions. Below, I give a possible series of questions for you to ask yourself as you try to solve difficult algorithm design problems:

1. Do I really understand the problem?

1. What exactly does the input consist of?
2. What exactly are the desired results or output?
3. Can I construct some examples small enough to solve by hand? What happens when I solve them?
4. Are you trying to solve a numerical problem? A graph algorithm problem? A geometric problem? A string problem? A set problem? Might your problem be formulated in more than one way? Which formulation seems easiest?

2. Can I find a simple algorithm for the problem?

1. Can I find the solve my problem exactly by searching all subsets or arrangements and picking the best one?

1. If so, why am I sure that this algorithm always gives the correct answer?
2. How do I measure the quality of a solution once I construct it?
3. Does this simple, slow solution run in polynomial or exponential time?
4. If I can't find a slow, guaranteed correct algorithm, am I sure that my problem is well defined enough to permit a solution?
2. Can I solve my problem by repeatedly trying some heuristic rule, like picking the biggest item first? The smallest item first? A random item first?
1. If so, on what types of inputs does this heuristic rule work well? Do these correspond to the types of inputs that might arise in the application?
2. On what types of inputs does this heuristic rule work badly? If no such examples can be found, can I show that in fact it always works well?
3. How fast does my heuristic rule come up with an answer?

3. Are there special cases of this problem I know how to solve exactly?

1. Can I solve it efficiently when I ignore some of the input parameters?
2. What happens when I set some of the input parameters to trivial values, such as 0 or 1?
3. Can I simplify the problem to create a problem I can solve efficiently? How simple do I have to make it?
4. If I can solve a certain special case, why can't this be generalized to a wider class of inputs?

4. Which of the standard algorithm design paradigms seem most relevant to the problem?

1. Is there a set of items which can be sorted by size or some key? Does this sorted order make it easier to find what might be the answer?
2. Is there a way to split the problem in two smaller problems, perhaps by doing a binary search, or a partition of the elements into big and small, or left and right? If so, does this suggest a divide-and-conquer algorithm?
3. Are there certain operations being repeatedly done on the same data, such as searching it for some element, or finding the largest/smallest remaining element? If so, can I use a data structure of speed up these queries, like hash tables or a heap/priority queue?

5. Am I still stumped?

1. Why don't I go back to the beginning of the list and work through the questions again? Do any of my answers from the first trip change on the second?       Next: Lecture 7 - elementary Up: No Title Previous: Lecture 5 - quicksort

Algorithms
Mon Jun 2 09:21:39 EDT 1997