Recently, a first version of our GEMM-based level 3 BLAS for superscalar type processors was announced. Like our previous GEMM-based work all other BLAS's perform the dominating part of the computations in calls to DGEMM.
In this talk, the evolution of the superscalar GEMM-based level 3 BLAS is presented. Also, new developments which include techniques that make the library applicable to symmetric multiprocessing (SMP) systems is described. Among these, algorithmic prefetching and recursive blocking are giving substantial speedup.
Recursive blocked data formats and recursive blocked BLAS's are introduced and applied to dense linear algebra algorithms that are typified by LAPACK. The new data formats allow for maintaining data locality at every level of the memory hierarchy and hence providing high performance on today's memory tiered processors. This new data format is hybrid. It contains blocking parameters which are chosen so that the associated submatrices of a block-partitioned A fit into level 1 cache. The recursive part of the data format chooses a linear order of the blocks that maintains a two-dimensional data locality of A in a one-dimensional tiered memory structure. This is because our algorithms are also recursive and will do their computations on submatrices that follow the new recursive data structure definition. This is in analogy with the well known principle that the data structure should be matched to the algorithm.
Performance results in support for our recursive approach and prefetching technologies are also presented.