The recently introduced roofline model [2] plots the performance of executed code against its operational intensity (operations count divided by memory traffic). Unfortunately, to date the model has been used almost exclusively with back-of- the-envelope calculations and not with measured data. In this work [1] we show how to produce roofline plots with measured data on recent generations of Intel platforms.

The roofline model includes two platform-specific performance ceilings: the processor’s peak performance and a ceiling derived from the memory bandwidth, which is relevant for code with low operational intensity. The model thus makes more precise the notions of memory- and compute-bound and, despite its simplicity, can provide an insightful visualization of bottlenecks. As such it can be valuable to guide manual code optimization as well as in education. The picture below sketches such a roofline plot for a platform with peak performance π = 2 flops/cycle and bandwidth β = 1 bytes/cycle. |

We show, to this extent for the first time, a set of roofline plots with measured data for common numerical functions on a variety of platforms. In [1] we discuss which hardware performance counters we use to obtain the appropriate measurements and discuss caveats. Furthermore we guide the reader through the experiments we conducted and the insight gained from measured roofline plots.

## Different levels of BLAS and FFT from Intel MKL on a Core i7-3930K (Windows), single threaded |

We run a set of kernels that are known to have different operational intensities: the BLAS 1–3 functions and the FFT. The figure above shows the performance of the set of kernels when executed sequentially on a Core i7-3930K running Windows as OS. The set of input sizes n is chosen to use different levels of the memory hierarchy. For daxpy we use n = 10000 + 30000i*i, (0 <= i < 10), for dgemv and dgemm n = 100 + 300i (0 <= i < 10), and for FFT, n = 2k (5 <= k < 23). Daxpy and dgemv are memory- bound as expected, reaching about 95% and 90% of the peak. Dgemm is compute-bound for all sizes reaching also about 95% of the peak p. Note that the FFT is plotted with measured flops and not with pseudo- flops (fixing the op count to 5n log (n), an overestimation), as commonly done. |

## Comparing various FFT implementations on a Xeon X5680 (Linux) |

We compare various FFT kernels in the figure above. As in the previous plots, we use sizes n = 2k (5 <= k < 23), except for the Spiral-generated kernels where we use an upper bound of 4096 (the code aviable via the software section of the spiral.net page). The order of the kernels, in terms of performance, is roughly as expected, with MKL and one version of Spiral using the SSE instruction set, and the rest of the implementations being unvectorized. The only surprise is the performance of the straightforward Numerical Recipes code. The Numerical Recipes code seems to perform surprisingly well, due to the fact that it calculates the twiddle constants at runtime for all sizes, resulting in an increased flop count and thus increased performance and operational intensity. |

## Comparing different optimization levels of MMM on a Xeon E5-2660 ( |

The figure above shows the roofline plot for the dgemm kernel at different levels of optimizations. In addition to ATLAS and MKL we consider a hand-coded standard triple loop and a six-fold loop obtained through blocking with block size 50. The MKL and ATLAS kernels provide improved ILP, SIMD vectorization, and multi-thread parallelism. |

- Georg Ofenbeck, Ruedi Steinmann, Victoria Caparros, Daniele G. Spampinato and Markus Püschel

**Applying the Roofline Model**

Proc. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2014, pp. 76-85 - Samuel Williams, Andrew Waterman, David Patterson

**Roofline: an insightful visual performance model for multicore architectures**

Communications ACM 55(6): 121-130 (2012) - Ruedi Steinmann

**Applying the Roofline Model**

Master Thesis ETH Zürich, 2012

Contact: Georg Ofenbeck