That look in his eyes should have warned me even before he started talking.

``We want to use a parallel supercomputer for a numerical calculation up to 1,000,000,000, but we need a faster algorithm to do it.''

I'd seen that distant look before. Eyes dulled from too much exposure to the raw horsepower of supercomputers. Machines so fast that brute force seemed to eliminate the need for clever algorithms. So it always seemed, at least until the problems got hard enough.

``I am working with a Nobel prize winner to use a computer on a famous problem in number theory. Are you familiar with Waring's problem?''

I knew some number theory [NZ80]. ``Sure. Waring's problem asks whether every integer can be expressed at least one way as the sum of at most four integer squares. For example, . It's a cute problem. I remember proving that four squares suffice in my undergraduate number theory class. Yes, it's a famous problem, sure, but one that got solved about 200 years ago.''

``No, we are interested in a different version of Waring's problem.
A *pyramidal number* is a number of the form , for .
Thus the first several pyramidal numbers are 1, 4, 10, 20, 35, 56, 84, 120,
and 165.
The conjecture since 1928 is that every integer can be represented by
the sum of at most five such pyramidal numbers.
We want to use a supercomputer to prove this conjecture on all
numbers from 1 to 1,000,000,000.''

``Doing a billion of anything will take a serious amount of time,'' I warned. ``The time you spend to compute the minimum representation of each number will be critical, because you are going to do it one billion times. Have you thought about what kind of an algorithm you are going to use?''

``We have already written our program and run it on a parallel supercomputer. It works very fast on smaller numbers. Still, it take much too much time as soon as we get to 100,000 or so.''

``Terrific,'' I thought.
Our supercomputer junkie had discovered asymptotic growth.
No doubt his algorithm ran in something like quadratic time, and
he got burned as soon as *n* got large.

``We need a faster program in order to get to one billion. Can you help us? Of course, we can run it on our parallel supercomputer when you are ready.''

I'll confess that I am a sucker for this kind of challenge, finding better algorithms to speed up programs. I agreed to think about it and got down to work.

I started by looking at the program that the other guy had
written.
He had built an array of all the pyramidal
numbers from *1* to *n*.
To test each number *k* in this range, he did a brute force test
to establish whether it was the sum of two pyramidal numbers.
If not, the program tested whether it was the sum of three of them,
then four, and finally five of them, until it first got an answer.
About 45% of the integers are expressible as the sum of three pyramidal
numbers, while
most of the remaining 55% require the sum of four; usually each
can be represented many different ways.
Only 241 integers are known to require the sum of five pyramidal numbers,
the largest one being 343,867.
For about half of the *n* numbers, this algorithm presumably went through
all of the three-tests and at least some of the four-tests before terminating.
Thus the total time for this algorithm would be at least
time, where *n*= 1,000,000,000.
No wonder his program cried uncle.

Anything that was going to do significantly better on a problem this large
had to avoid explicitly testing all triples.
For each value of *k*, we were seeking the smallest number of pyramidal
numbers that sum exactly to *k*.
This problem has a name, the *knapsack problem*, and it is discussed in
Section .
In our case of interest,
the weights are the set of pyramidal
numbers no greater than *n*, with an additional constraint that the
knapsack holds exactly *k* items.

The standard approach to solving knapsack precomputes the
sum of smaller subsets of the items for use in computing larger subsets.
In other words, if we want to know whether *k* is expressible as the sum of
three numbers, and we have a table of all sums of two numbers,
all we have to do is ask whether our number
is expressible as the sum of a single number plus a number in this two-table.

Thus what I needed was a table of all integers less than *n* that could
be expressed as the sum of two of the 1,818 pyramidal numbers less than
1,000,000,000.
There could be at most = 3,305,124 of them.
Actually, there would be only about half
this many because we could eliminate duplicates, or any sum bigger than
our target.
Building a data structure, such as a sorted array,
to hold these numbers would be no big deal.
Call this data structure the *two*-table.

To find the minimum decomposition for a given *k*, I would start out
by checking whether it was one of the 1,818 pyramidal numbers.
If not, I would then search to see whether *k* was in the sorted table of
the sums of
two pyramidal numbers.
If it wasn't,
to see whether *k* was expressible as the sum of three such numbers,
all I had to do
was check whether *k* - *p*[*i*] was in the *two*-table
for .
This could be done quickly using binary search.
To see whether *k* was expressible as the sum of four pyramidal
numbers, I had to
check whether *k* - *two*[*i*] was in the two-table for all .
However, since almost every *k* was expressible in many ways as the sum of
four pyramidal numbers, this latter test would terminate quickly, and the
time taken would be dominated by the cost of the threes.
Testing whether *k* was the sum of three pyramidal
numbers would
take .
Running this on each of the *n* integers gives an
algorithm for the complete job.
Comparing this to his algorithm
for *n*= 1,000,000,000
suggested that my algorithm was
a cool *30,000* times faster than the original!

My first attempt to code this up solved up to *n*=1,000,000 on my crummy
Sparc ELC in about 20 minutes.
From here, I experimented with different data structures to represent the
sets of numbers and different algorithms to search these tables.
I tried using hash tables and bit vectors instead of sorted arrays
and experimented with variants of binary search such as interpolation search
(see Section ).
My reward for this work was solving
up to *n*=1,000,000 in under three minutes,
a factor of six improvement over my original program.

With the real thinking done, I worked to tweak a little more performance
out of the program.
I avoided doing a sum-of-four computation on any *k* when *k-1* was the
sum-of-three, since 1 is a pyramidal number, saving about 10% of
the total run time on this trick alone.
Finally, I got out my profiler and tried some low-level tricks to squeeze
a little more performance out of the code.
For example,
I saved another 10% by replacing a single procedure call with in-line code.

At this point, I turned the code over to the supercomputer guy. What he did with it is a depressing tale, which is reported in Section .

In writing up this war story, I went back to rerun our program almost five years later. On my desktop Sparc 5, getting to 1,000,000 now took 167.5 seconds using the cc compiler without turning on any compiler optimization. With level 2 optimization, the job ran in only 81.8 seconds, quite a tribute to the quality of the optimizer. The gcc compiler with optimization did even better, taking only 74.9 seconds to get to 1,000,000. The run time on my desktop machine had improved by a factor of about three over a four-year period. This is probably typical for most desktops.

The primary importance of this war story is to illustrate the enormous potential for algorithmic speedups, as opposed to the fairly limited speedup obtainable via more expensive hardware. I sped his program up by about 30,000 times. His million-dollar computer had 16 processors, each reportedly five times faster on integer computations than the $3,000 machine on my desk. That gave a maximum potential speedup of less than 100 times. Clearly, the algorithmic improvement was the big winner here, as it is certain to be in any sufficiently large computation.

Mon Jun 2 23:33:50 EDT 1997