Abstract
We present a hardware generator for computations with regular structure including the fast Fourier transform (FFT), sorting networks, and others. The input of the generator is a high-level description of the algorithm; the output is a token-based, synchronized design in the form of RTL-Verilog. Building on prior work, the generator uses several layers of domain-specific languages (DSLs) to represent and optimize at different levels of abstraction to produce a RAM- and area-efficient hardware implementation. Two of these layers and DSLs are novel. The first one allows the use and domain-specific optimization of state-of-the-art streaming permutations. The second DSL enables the automatic pipelining of a streaming hardware dataflow and the synchronization of its data-independent control signals. The generator including the DSLs are implemented in Scala, leveraging its type system, and uses concepts from lightweight modular staging (LMS) to handle the constraints of streaming hardware. Particularly, these concepts offer genericity over hardware number representation, including seamless switching between fixed-point arithmetic and FloPoCo generated IEEE floating-point operators, while ensuring type-safety. We show benchmarks of generated FFTs, sorting networks, and Walsh-Hadamard transforms that outperform prior generators.
- Jacques Hadamard. 1893. Résolution d’une question relative aux déterminants. Bulletin des sciences mathématiques 17 (1893), 240--246.Google Scholar
- J. Astola and D. Akopian. 1999. Architecture-oriented regular algorithms for discrete sine and cosine transforms. IEEE Trans. Signal Process. 47, 4 (1999), 1109--1124.Google Scholar
Digital Library
- Ken Edward Batcher. 1968. Sorting networks and their applications. In Proceedings of the Spring Joint Computer Conference(AFIPS’68), Vol. 32. 307--314.Google Scholar
Digital Library
- Harold S. Stone. 1971. Parallel processing with the perfect shuffle. IEEE Trans. Comput. 20, 2 (1971), 153--161.Google Scholar
Digital Library
- Váaclav Edvard Beneš. 1965. Mathematical Theory of Connecting Networks and Telephone Traffic. Academic Press.Google Scholar
- Abraham Waksman. 1968. A permutation network. J. ACM 15, 1 (1968), 159--163.Google Scholar
Digital Library
- Marshall C. Pease. 1977. The indirect binary n-cube microprocessor array. IEEE Trans. Comput. 26, 5 (1977), 458--473.Google Scholar
Digital Library
- Jacques Lenfant and Serge Tahé. 1985. Permuting data with the Omega network. ACTA Informatica 21, 6 (1985), 629--641.Google Scholar
Cross Ref
- David Steinberg. 1983. Invariant properties of the shuffle-exchange and a simplified cost-effective version of the Omega network. IEEE Trans. Comput. 32, 5 (1983), 444--450.Google Scholar
Digital Library
- David Nassimi and Sartaj Sahni. 1981. A self-routing Benes network and parallel permutation algorithms. IEEE Trans. Comput. 30, 5 (1981), 332--340.Google Scholar
Digital Library
- Danny Cohen. 1976. Simplified control of FFT hardware. IEEE Trans. Acoust. Speech Signal Process. 24, 6 (1976), 577--579.Google Scholar
Cross Ref
- Pinit Kumhom, Jeremy R. Johnson, and Prawat Nagvajara. 2000. Design, optimization, and implementation of a universal FFT processor. In Proceedings of the International ASIC/SOC Conference (ASIC’00). 182--186.Google Scholar
Cross Ref
- Ainhoa Cortés, Igone Vélez, and Juan F. Sevillano. 2009. Radix rk FFTs: Matricial representation and SDC/SDF pipeline implementation. IEEE Trans. Signal Process. 57, 7 (2009), 2824--2839.Google Scholar
Digital Library
- Shousheng He and Mats Torkelson. 1996. A new approach to pipeline FFT processor. In Proceedings of the Parallel Processing Symposium (IPPS’96). 766--770.Google Scholar
- Berkin Akin, Franz Franchetti, and James C. Hoe. 2015. FFTs with near-optimal memory access through block data layouts: Algorithm, architecture and design automation. J. Signal Process. Syst. 85, 1 (2015), 67--82.Google Scholar
Digital Library
- Byung G. Jo and Myung H. Sunwoo. 2005. New continuous-flow mixed-radix (CFMR) FFT Processor using novel in-place strategy. IEEE Trans. Circ. Syst. I 52, 5 (2005), 911--919.Google Scholar
- Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. 2012. Computer generation of hardware for linear digital signal processing transforms. ACM Trans. Design Autom. Electron. Syst. 17, 2 (2012), 15:1--15:33.Google Scholar
Digital Library
- Mario Garrido, Miguel Ángel Sánchez, María Luisa López-Vallejo, and Jesùs Grajal. 2017. A 4096-point Radix-4 memory-based FFT using DSP slices. IEEE Trans. Very Large Scale Integr. Syst. 25, 1 (2017), 375--379.Google Scholar
Digital Library
- Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. 2008. Linear transforms: From math to efficient hardware. In Proceedings of the Workshop on High-Level Synthesis Colocated with DAC.Google Scholar
- Grace Nordin, Peter A. Milder, James C. Hoe, and Markus Püschel. 2005. Automatic generation of customized discrete Fourier transform IPs. In Proceedings of the Design Automation Conference (DAC’05). 471--474.Google Scholar
Digital Library
- Marcela Zuluaga, Peter A. Milder, and Markus Püschel. 2012. Computer generation of streaming sorting networks. In Proceedings of the Design Automation Conference (DAC’12). 1245--1253.Google Scholar
Digital Library
- Marcela Zuluaga, Peter A. Milder, and Markus Püschel. 2016. Streaming sorting networks. ACM Trans. Design Autom. Electron. Syst. 21, 4 (2016).Google Scholar
Digital Library
- Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nicholas Rizzolo. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE Spec. Issue 93, 2 (2005), 232--275.Google Scholar
Cross Ref
- Florent de Dinechin and Bogdan Pasca. 2011. Designing custom arithmetic data paths with FloPoCo. IEEE Design Test Comput. 28, 4 (2011), 18--27.Google Scholar
Digital Library
- François Serre, Thomas Holenstein, and Markus Püschel. 2016. Optimal circuits for streamed linear permutations using RAM. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’16). 215--223.Google Scholar
Digital Library
- Thaddeus Koehn and Peter Athanas. 2016. Arbitrary streaming permutations with minimum memory and latency. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’16). 1--6.Google Scholar
Digital Library
- François Serre. 2019. Optimal Streaming Permutations and Transforms: Theory and Implementation. Ph.D. Dissertation. ETH Zurich.Google Scholar
- François Serre and Markus Püschel. 2018. Memory-efficient fast Fourier transform on streaming data by fusing permutations. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’18). 219--228.Google Scholar
Digital Library
- Martin Odersky, Lex Spoon, and Bill Venners. 2008. Programming in Scala. Artima Inc.Google Scholar
- Tiark Rompf and Martin Odersky. 2012. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. Commun. ACM 55, 6 (June 2012), 121--130.Google Scholar
Digital Library
- François Serre. 2018. SGen—A streaming hardware generator. Retrieved from https://acl.inf.ethz.ch/research/hardware/.Google Scholar
- François Serre. 2018. DFT and streamed linear permutation generator for hardware. Retrieved from https://github.com/fserre/sgen.Google Scholar
- François Serre and Markus Püschel. 2018. A DSL-based FFT hardware generator in Scala. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’18). 315--322.Google Scholar
Cross Ref
- Jianxin Xiong, Jeremy Johnson, Robert W. Johnson, and David Padua. 2001. SPL: A language and compiler for DSP algorithms. In Proceedings of the Conference on Programming Languages Design and Implementation (PLDI’01). 298--308.Google Scholar
Digital Library
- J. R. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri. 1990. A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures. Circ. Syst. Signal Process. 9, 4 (1990), 449--500.Google Scholar
Digital Library
- Georg Ofenbeck, Tiark Rompf, Alen Stojanov, Martin Odersky, and Markus Püschel. 2013. Spiral in Scala: Towards the systematic construction of generators for performance libraries. In Proceedings of the International Conference on Generative Programming: Concepts 8 Experiences (GPCE’13). 125--134.Google Scholar
Digital Library
- Marshall C. Pease. 1968. An adaptation of the fast Fourier transform for parallel processing. J. ACM 15, 2 (1968), 252--264.Google Scholar
Digital Library
- Markus Püschel, Peter A. Milder, and James C. Hoe. 2009. Permuting streaming data using RAMs. J. ACM 56, 2 (2009), 10:1--10:34.Google Scholar
Digital Library
- Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, and Markus Püschel. 2009. Operator language: A program generation framework for fast kernels. In Proceedings of the IFIP Working Conference on Domain Specific Languages (DSL WC’09) (Lecture Notes in Computer Science), Vol. 5658. Springer, 385--410.Google Scholar
Digital Library
- Philip Wadler and Stephen Blott. 1989. How to make ad-hoc polymorphism less ad hoc. In Proceedings of the 16th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 60--76.Google Scholar
Digital Library
- Georg Ofenbeck, Tiark Rompf, and Markus Püschel. 2017. Staging for generic programming in space and time. In Proceedings of the International Conference on Generative Programming: Concepts 8 Experiences (GPCE’17). 15--28.Google Scholar
Digital Library
- Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing hardware in a Scala embedded language. In Proceedings of the Design Automation Conference (DAC’12). 1216--1225.Google Scholar
Digital Library
- O. Port and Y. Etsion. 2017. DFiant: A dataflow hardware description language. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’17). 1--4.Google Scholar
- Arvind Sujeeth, HyoukJoong Lee, Kevin Brown, Tiark Rompf, Hassan Chafi, Michael Wu, Anand Atreya, Martin Odersky, and Kunle Olukotun. 2011. OptiML: An implicitly parallel domain-specific language for machine learning. In Proceedings of the International Conference on Machine Learning (ICML’11). 609--616.Google Scholar
- Nithin George, HyoukJoong Lee, David Novo, Muhsen Owaida, David Andrews, Kunle Olukotun, and Paolo Ienne. 2015. Automatic support for multi-module parallelism from computational patterns. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’15). 1--8.Google Scholar
Cross Ref
- Nithin George, David Novo, Tiark Rompf, Martin Odersky, and Paolo Ienne. 2013. Making domain-specific hardware synthesis tools cost-efficient. In Proceedings of the International Conference on Field-Programmable Technology (FPT’13). 120--127.Google Scholar
Cross Ref
- Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’11). 33--36.Google Scholar
Digital Library
- Geoffrey Mainland and Jeremy Johnson. 2017. A Haskell compiler for signal transforms. In Proceedings of the International Conference on Generative Programming: Concepts and Experiences (GPCE’17). 219--232.Google Scholar
Digital Library
- Georg Ofenbeck. 2017. Generic Programming in Space and Time. Ph.D. Dissertation. ETH Zurich.Google Scholar
- Donald Ervin Knuth. 1978. The Art of Computer Programming (Addison-Wesley Series in Computer Science and Information, 2nd ed. Addison-Wesley Longman Publishing, Boston, MA.Google Scholar
- Rene Mueller, Jens Teubner, and Gustavo Alonso. 2012. Sorting networks on FPGAs. VLDB J. 21, 1 (2012), 1--23.Google Scholar
Digital Library
- Ren Chen, Sruja Siriyal, and Viktor Prasanna. 2015. Energy and memory efficient mapping of bitonic sorting on FPGA. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’15). 240--249.Google Scholar
Digital Library
- J. Ortiz and D. Andrews. 2010. A configurable high-throughput linear sorter system. In Proceedings of the IEEE International Symposium on Parallel Distributed Processing, Workshops, and Ph.D. Forum (IPDPSW’10). 1--8.Google Scholar
Index Terms
DSL-Based Hardware Generation with Scala: Example Fast Fourier Transforms and Sorting Networks
Recommendations
Computer Generation of Hardware for Linear Digital Signal Processing Transforms
Linear signal transforms such as the discrete Fourier transform (DFT) are very widely used in digital signal processing and other domains. Due to high performance or efficiency requirements, these transforms are often implemented in hardware. This ...
Generating Configurable Hardware from Parallel Patterns
ASPLOS '16In recent years the computing landscape has seen an increasing shift towards specialized accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the implementation of these accelerators, as they offer significant performance ...
Generating Configurable Hardware from Parallel Patterns
ASPLOS'16In recent years the computing landscape has seen an increasing shift towards specialized accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the implementation of these accelerators, as they offer significant performance ...






Comments