High-Performance Matrix
Multiplication on the IBM SP High Node André Henriksson1 and
Isak Jonsson1 Abstract The computing performance of processors in high-performance
computers is increasing steadily. The overall memory bandwidth has not grown at the same rate. Instead, the memory hierarchies have got more complex,
with more number of caches. Programs, which need to utilize the full power of the processors, have to adjust their data reference patterns to fit the
memory models. In this paper, we show a way of organizing algorithms and corresponding data structures for linear algebra routines, which enables
automatic tuning for an arbitrary number of caches by using recursive technologies. We show how performance for matrix multiplication is increased by
51 % compared to existing routines for the IBM PowerPC 604 by using fine tuned kernels, algorithmic prefetching, recursive algorithms and data
structures. We also present an algorithm for scheduling matrix multiplication on an SMP-node. Discussions on how kernels should be implemented and a
cache simulation model are also included.
- Department of Computing Science and HPC2N, Umeå University, SE-901 87 Umeå, Sweden
E-mail: {andropov,isak}@cs.umu.se |