We introduce LGen, a compiler that produces performance-optimized basic linear algebra computations (BLACs) of fixed size. By "basic linear algebra" we mean computations on matrices, vectors, and scalars that are composed from matrix multiplication, matrix addition, transposition, and scalar multiplication.
Examples of BLACs include:
The input to LGen is a BLAC including the (fixed) size of its operands. The output is an optimized C function, optionally using intrinsics for vectorization, that computes the BLAC.
The picture below provides an overview of LGen’s internal structure. LGen is designed after the version of Spiral that produces looped fixed size code [3].
A valid input to LGen is a BLAC with input and output operands of fixed size. The BLAC is specified in a DSL that we call Linear algebra Language (LL). In the first step, the tiling is fixed for the computation. The resulting fully-tiled computation is translated into a second DSL called Σ-LL, which is based on Σ-SPL used in Spiral [3]. The latter is still a valid mathematical representation of the original BLAC but makes loops and access functions explicit. At this level LGen performs loop-level optimizations such as loop merging and exchange. Next, the Σ-LL expression is translated into a C intermediate representation (C-IR) for code-level optimizations, such as loop unrolling and translation into SSA form. Finally, the C-IR code is unparsed into C and performance results are used in the autotuning feedback loop.
Vectorization is done by decomposing the computation into a fixed set of nu-BLACs pre-implemented once for every vector architecture:
More details about LGen and its vectorization process can be found in [1].
The plots below show performance results for four different classes of single precision floating point BLACs on an Intel Xeon X5680 with SSE 4.2 and 32 kB L1 D-cache. In the first three cases we use matrices with narrow rectangular shapes (panels) or small squares (blocks). The panel sizes are either n x 4 or 4 x n, chosen to fit into L1 D-cache. For the last case (micro-BLACs) the matrices are all n x n, with 2 ≤ n ≤ 10.
Case 1: Simple BLAC - y = Ax
A is n x 4 | A is 4 x n |
Case 2: BLAC that closely matches BLAS - C = αAB + βC
A is n x 4, B is 4 x 4 | A is 4 x 4, B is 4 x n | |
A is 4 x n, B is n x 4 | A is n x 4, B is 4 x n |
Case 3: BLAC that needs more than one BLAS call - C = α(A_{0}+A_{1})^{T}B + βC
A_{0} is 4 x n, B is 4 x 4 | A_{0} is 4 x 4, B is 4 x n | |
A_{0} is n x 4, B is n x 4 | A_{0} is 4 x n, B is 4 x n |
Case 4: Micro-BLACs - n x n, with 2 ≤ n ≤ 10
y = Ax | C = AB | |
δ = x^{T}Ay |
Contact: Daniele Spampinato, [first].[last] AT inf.ethz.ch