The Spiral Sorting Network IP Generator automatically generates customized sorting networks in synthesizable RTL Verilog. The generator is powered by our formula-driven hardware compilation tool. The networks can be latency or throughput optimized.
Sorting networks operate in parallel over the input elements, processing them through a network of comparison-exchange elements. After every comparison-exchange stage elements are permuted such that a sorted sequence is obtained at the final stage.
Sorting networks offer great performance and are particularly attractive for hardware implementations due to their inherent parallelism and input-independent structure. However, this solution becomes prohibitively expensive for large data sets because of the area cost or since elements can no longer be provided at the same time. We introduce sorting networks that can offer a variable streaming width, while maintaining high throughput capabilities. This approach dramatically reduces the footprint of sorting networks by effectively reusing hardware components to operate sequentially over the input elements.
We introduce a small domain-specific language (DSL) to represent sorting networks, borrowing concepts from  and  and using the streaming permutations from . The elements of the language are structured operators (viewed as data flow graphs) that map vectors to vectors of the same length. Our DSL is designed to represent sorting networks with a regular structure, specifically those based on bitonic sorting [5, pp. 230].
A selection of five different sorting network architectures, derived from the original definition of bitonic sorting , is given to the user. The figure below shows an example one of one of these architectures (SN3) applied to an input set of 8 samples, and how every stage is represented in our DSL.
In addition to the sorting network architecture, the user has control over a variety of parameters that control the functionality and cost/performance tradeoffs such as area and throughput. Each architecture exposes different tradeoff when combined with different configurations of the implementation parameters. A short explanation of each architecture is given by clicking (?) in the architecture field of the form below. For more information, please see  and other related previous work in [2,3,6,7].
Our complete tool flow has considerably more flexibility than the web version presented here. In particular, we have constrained iterative reuse options to maximum reuse and minimum reuse in order to simplify the user interface. Generated sorting networks are configured to sort in down-up order. If you have an application that could benefit from our work, please feel free to contact us at the address given at the bottom of this page.
Please select parameters in order (from top to bottom), as the parameters chosen may restrict the choices below. For more information about any parameter, please click on (?) in the explanation column.
The design space of solutions generated by the available algorithmic and implementation choices has been evaluated and compared against other sorting algorithms for hardware. BRAM budget was only evaluated with configurations 0 (no BRAMs) and -1 (no limit). All designs are synthesized using Xilinx ISE, and all cost and performance data are collected after place and route.
Design space. The plots below show an evaluation of the design space for the sizes n = 16, 256, 2048. All designs that fit onto the target FPGA (Xilinx Virtex-6 XC6VLX760) are shown. In each plot, the x-axis is the number of FPGA slices used, the y-axis the throughput in Giga samples per second, and the size of the marker is proportional to the number of BRAMs used. Plot (c) zooms into a small region of plot (b). Pareto-optimal designs are connected by a line. The smallest Pareto-optimal design in all cases is obtained with SN5, maximum iterative reuse and minimum streaming width (2). For comparison purposes, we added to our design space an implementation of a linear sorter, as described in [4, pp. 5]. Linear sorters input one element at the time, sorting them into a register array as they are received.
The plot below shows the area and latency of designs for size n = 2048 that fit into 10,000 slices.
Comparison to prior work. To the best of our knowledge, we generate the first designs for sorting networks with streaming reuse to dramatically reduce area cost, even for high throughput designs. This is illustrated in the figure below, which shows the area cost (y-axis) of various sorters for increasing n (x-axis) that fit onto the target FPGA. With prior work it was possible to build the designs of the four uppermost lines [4,8,9]. Our work enables, for example, the two bottom lines: fully streaming designs (SN1 fully streaming with streaming width 2) and a design with maximal reuse (SN5 iterative with streaming width 2), i.e., it uses only one configurable 2-input sorter. Both designs require more BRAM but considerably less slices. Even a sorter for 219 elements can be fit onto the target FPGA.
We also compared our designs with the interleaved linear sorters (ILS) presented in . For n = 64, the plot below shows throughput and area of our Pareto-optimal designs connected by a line and the ILS designs with w = 1, 2, 4, 8 from .
Contact: Marcela Zuluaga, [first].[last] AT inf.ethz.ch