LGen: A Basic Linear Algebra Compiler

Explanation

We introduce LGen, a compiler that produces performance-optimized basic linear algebra computations (BLACs) on matrices of fixed size with or without structure. By "basic linear algebra" we mean computations on matrices, vectors, and scalars that are composed from matrix multiplication, matrix addition, transposition, and scalar multiplication.

Examples of BLACs include:

  1. Simple computations, such as y = Ax.
  2. Computations that closely match the BLAS interface, e.g., C = αABT + βC and Su = AAT + Su.
  3. Computations that need more than one BLAS call, e.g., γ = xT(A+B)y + δ and A = LU + Sl.

The input to LGen is a BLAC including the (fixed) size of its operands and information about their structure (e.g., L and U are lower and upper triangular while Sl(u) is symmetric and only its lower (upper) part is used). The output is an optimized C function, optionally using intrinsics for vectorization, that computes the BLAC.

Code Generation Overview

The picture below provides an overview of LGen’s internal structure. LGen is designed after the version of Spiral that produces looped fixed size code [5].

A valid input to LGen is a BLAC with input and output operands of fixed size. The BLAC is specified in a DSL that we call Linear algebra Language (LL). In (1), the tiling is fixed for the computation. The resulting fully-tiled computation is translated (2) into a second DSL called Σ-LL, which is based on Σ-OL used in Spiral [4]. The latter is still a valid mathematical representation of the original BLAC but makes loops and access functions explicit. The translation in (2) is performed using ideas and tools from the polyhedral framework. In particular, we capture the structures of the matrices involved mathematically using the isl library [6] and synthetize the final Σ-LL computations with CLooG [7]. At level (2-3) LGen performs loop-level optimizations such as loop merging and exchange and decomposes the computation into a set of computational building blocks called nu-BLACs (see below). Next (4), the Σ-LL expression is translated into a C intermediate representation (C-IR) for code-level optimizations, such as loop unrolling and scalar replacement. Finally, the C-IR code is unparsed into C and performance results are used in the autotuning feedback loop (5).

Vectorization is introduced in (2-3) by decomposing the computation into a fixed set of nu-BLACs pre-implemented once for every vector architecture:

More details about LGen and its vectorization process can be found in [1] and [2].

General Matrices Benchmarks

The plots below show performance results for four different classes of single precision floating point BLACs on an Intel Xeon X5680 with SSE 4.2 and 32 kB L1 D-cache. In the first three cases we use matrices with narrow rectangular shapes (panels) or small squares (blocks). The panel sizes are either n x 4 or 4 x n, chosen to fit into L1 D-cache. For the last case (micro-BLACs) the matrices are all n x n, with 2 ≤ n ≤ 10. In all cases the matrices involved don't have any particular structure.

Case 1: Simple BLAC - y = Ax

A is n x 4 A is 4 x n

Case 2: BLAC that closely matches BLAS - C = αAB + βC

A is n x 4, B is 4 x 4 A is 4 x 4, B is 4 x n
A is 4 x n, B is n x 4 A is n x 4, B is 4 x n

Case 3: BLAC that needs more than one BLAS call - C = α(A0+A1)TB + βC

A0 is 4 x n, B is 4 x 4 A0 is 4 x 4, B is 4 x n
A0 is n x 4, B is n x 4 A0 is 4 x n, B is 4 x n

Case 4: Micro-BLACs - n x n, with 2 ≤ n ≤ 10

y = Ax C = AB
δ = xTAy

Structured Matrices Benchmarks

The plots below show performance results for two different classes of double precision floating point structured BLACs (sBLACs) on an Intel Core i7-2600 CPU (Sandy Bridge microarchitecture) with AVX, 32 kB L1 D-cache and 256 kB L2 cache. In the following L, U, and Sl(u) are respectively lower triangular, upper triangular, and symmetric. In the case of Sl(u) we also specify that only the lower (upper) part is used.

Case 1: sBLAC that closely matches BLAS - Su = AAT + Su

A is n x 4, Su is n x n A is n x 4, Su is n x n, 4 | n

Case 2: sBLAC that needs more than one BLAS call - A = LU + Sl

L and U are n x n L and U are n x n, 4 | n

References

  1. Daniele G. Spampinato and Markus Püschel  
    A Basic Linear Algebra Compiler for Structured Matrices  
    Proc. International Symposium on Code Generation and Optimization (CGO), pp. 117-127, 2016  
    CGO 2016 highest ranked artifact
  2. Nikolaos Kyrtatas, Daniele G. Spampinato and Markus Püschel 
    A Basic Linear Algebra Compiler for Embedded Processors 
    Proc. Design, Automation and Test in Europe (DATE), pp. 1054-1059, 2015
  3. Daniele G. Spampinato and Markus Püschel 
    A Basic Linear Algebra Compiler 
    Proc. International Symposium on Code Generation and Optimization (CGO), pp. 23-32, 2014  
    Best paper award nominee, among 4 out of 29
  4. Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin and Markus Püschel 
    Operator Language: A Program Generation Framework for Fast Kernels 
    Proc. IFIP Working Conference on Domain Specific Languages (DSL WC), Lecture Notes in Computer Science, Springer, Vol. 5658, pp. 385-410, 2009
  5. Franz Franchetti, Yevgen Voronenko and Markus Püschel 
    Formal Loop Merging for Signal Transforms 
    Proc. Programming Languages Design and Implementation (PLDI), pp. 315-326 , 2005
  6. S. Verdoolaege 
    isl: An Integer Set Library for the Polyhedral Model 
    Mathematical Software (ICMS), Lecture Notes in Computer Science, Springer, Vol. 6327, pp. 299-302, 2010
  7. C. Bastoul 
    Code Generation in the Polyhedral Model Is Easier Than You Think 
    Proc. International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 7–16, 2004

More information

Contact: Daniele Spampinato, [first].[last] AT inf.ethz.ch