skip to main content
research-article

DSL-Based Hardware Generation with Scala: Example Fast Fourier Transforms and Sorting Networks

Authors Info & Claims
Published:19 December 2019Publication History
Skip Abstract Section

Abstract

We present a hardware generator for computations with regular structure including the fast Fourier transform (FFT), sorting networks, and others. The input of the generator is a high-level description of the algorithm; the output is a token-based, synchronized design in the form of RTL-Verilog. Building on prior work, the generator uses several layers of domain-specific languages (DSLs) to represent and optimize at different levels of abstraction to produce a RAM- and area-efficient hardware implementation. Two of these layers and DSLs are novel. The first one allows the use and domain-specific optimization of state-of-the-art streaming permutations. The second DSL enables the automatic pipelining of a streaming hardware dataflow and the synchronization of its data-independent control signals. The generator including the DSLs are implemented in Scala, leveraging its type system, and uses concepts from lightweight modular staging (LMS) to handle the constraints of streaming hardware. Particularly, these concepts offer genericity over hardware number representation, including seamless switching between fixed-point arithmetic and FloPoCo generated IEEE floating-point operators, while ensuring type-safety. We show benchmarks of generated FFTs, sorting networks, and Walsh-Hadamard transforms that outperform prior generators.

References

  1. Jacques Hadamard. 1893. Résolution d’une question relative aux déterminants. Bulletin des sciences mathématiques 17 (1893), 240--246.Google ScholarGoogle Scholar
  2. J. Astola and D. Akopian. 1999. Architecture-oriented regular algorithms for discrete sine and cosine transforms. IEEE Trans. Signal Process. 47, 4 (1999), 1109--1124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ken Edward Batcher. 1968. Sorting networks and their applications. In Proceedings of the Spring Joint Computer Conference(AFIPS’68), Vol. 32. 307--314.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Harold S. Stone. 1971. Parallel processing with the perfect shuffle. IEEE Trans. Comput. 20, 2 (1971), 153--161.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Váaclav Edvard Beneš. 1965. Mathematical Theory of Connecting Networks and Telephone Traffic. Academic Press.Google ScholarGoogle Scholar
  6. Abraham Waksman. 1968. A permutation network. J. ACM 15, 1 (1968), 159--163.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Marshall C. Pease. 1977. The indirect binary n-cube microprocessor array. IEEE Trans. Comput. 26, 5 (1977), 458--473.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jacques Lenfant and Serge Tahé. 1985. Permuting data with the Omega network. ACTA Informatica 21, 6 (1985), 629--641.Google ScholarGoogle ScholarCross RefCross Ref
  9. David Steinberg. 1983. Invariant properties of the shuffle-exchange and a simplified cost-effective version of the Omega network. IEEE Trans. Comput. 32, 5 (1983), 444--450.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. David Nassimi and Sartaj Sahni. 1981. A self-routing Benes network and parallel permutation algorithms. IEEE Trans. Comput. 30, 5 (1981), 332--340.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Danny Cohen. 1976. Simplified control of FFT hardware. IEEE Trans. Acoust. Speech Signal Process. 24, 6 (1976), 577--579.Google ScholarGoogle ScholarCross RefCross Ref
  12. Pinit Kumhom, Jeremy R. Johnson, and Prawat Nagvajara. 2000. Design, optimization, and implementation of a universal FFT processor. In Proceedings of the International ASIC/SOC Conference (ASIC’00). 182--186.Google ScholarGoogle ScholarCross RefCross Ref
  13. Ainhoa Cortés, Igone Vélez, and Juan F. Sevillano. 2009. Radix rk FFTs: Matricial representation and SDC/SDF pipeline implementation. IEEE Trans. Signal Process. 57, 7 (2009), 2824--2839.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Shousheng He and Mats Torkelson. 1996. A new approach to pipeline FFT processor. In Proceedings of the Parallel Processing Symposium (IPPS’96). 766--770.Google ScholarGoogle Scholar
  15. Berkin Akin, Franz Franchetti, and James C. Hoe. 2015. FFTs with near-optimal memory access through block data layouts: Algorithm, architecture and design automation. J. Signal Process. Syst. 85, 1 (2015), 67--82.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Byung G. Jo and Myung H. Sunwoo. 2005. New continuous-flow mixed-radix (CFMR) FFT Processor using novel in-place strategy. IEEE Trans. Circ. Syst. I 52, 5 (2005), 911--919.Google ScholarGoogle Scholar
  17. Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. 2012. Computer generation of hardware for linear digital signal processing transforms. ACM Trans. Design Autom. Electron. Syst. 17, 2 (2012), 15:1--15:33.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Mario Garrido, Miguel Ángel Sánchez, María Luisa López-Vallejo, and Jesùs Grajal. 2017. A 4096-point Radix-4 memory-based FFT using DSP slices. IEEE Trans. Very Large Scale Integr. Syst. 25, 1 (2017), 375--379.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. 2008. Linear transforms: From math to efficient hardware. In Proceedings of the Workshop on High-Level Synthesis Colocated with DAC.Google ScholarGoogle Scholar
  20. Grace Nordin, Peter A. Milder, James C. Hoe, and Markus Püschel. 2005. Automatic generation of customized discrete Fourier transform IPs. In Proceedings of the Design Automation Conference (DAC’05). 471--474.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Marcela Zuluaga, Peter A. Milder, and Markus Püschel. 2012. Computer generation of streaming sorting networks. In Proceedings of the Design Automation Conference (DAC’12). 1245--1253.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Marcela Zuluaga, Peter A. Milder, and Markus Püschel. 2016. Streaming sorting networks. ACM Trans. Design Autom. Electron. Syst. 21, 4 (2016).Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nicholas Rizzolo. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE Spec. Issue 93, 2 (2005), 232--275.Google ScholarGoogle ScholarCross RefCross Ref
  24. Florent de Dinechin and Bogdan Pasca. 2011. Designing custom arithmetic data paths with FloPoCo. IEEE Design Test Comput. 28, 4 (2011), 18--27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. François Serre, Thomas Holenstein, and Markus Püschel. 2016. Optimal circuits for streamed linear permutations using RAM. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’16). 215--223.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Thaddeus Koehn and Peter Athanas. 2016. Arbitrary streaming permutations with minimum memory and latency. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’16). 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. François Serre. 2019. Optimal Streaming Permutations and Transforms: Theory and Implementation. Ph.D. Dissertation. ETH Zurich.Google ScholarGoogle Scholar
  28. François Serre and Markus Püschel. 2018. Memory-efficient fast Fourier transform on streaming data by fusing permutations. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’18). 219--228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Martin Odersky, Lex Spoon, and Bill Venners. 2008. Programming in Scala. Artima Inc.Google ScholarGoogle Scholar
  30. Tiark Rompf and Martin Odersky. 2012. Lightweight modular staging: A pragmatic approach to runtime code generation and compiled DSLs. Commun. ACM 55, 6 (June 2012), 121--130.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. François Serre. 2018. SGen—A streaming hardware generator. Retrieved from https://acl.inf.ethz.ch/research/hardware/.Google ScholarGoogle Scholar
  32. François Serre. 2018. DFT and streamed linear permutation generator for hardware. Retrieved from https://github.com/fserre/sgen.Google ScholarGoogle Scholar
  33. François Serre and Markus Püschel. 2018. A DSL-based FFT hardware generator in Scala. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’18). 315--322.Google ScholarGoogle ScholarCross RefCross Ref
  34. Jianxin Xiong, Jeremy Johnson, Robert W. Johnson, and David Padua. 2001. SPL: A language and compiler for DSP algorithms. In Proceedings of the Conference on Programming Languages Design and Implementation (PLDI’01). 298--308.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. R. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri. 1990. A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures. Circ. Syst. Signal Process. 9, 4 (1990), 449--500.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Georg Ofenbeck, Tiark Rompf, Alen Stojanov, Martin Odersky, and Markus Püschel. 2013. Spiral in Scala: Towards the systematic construction of generators for performance libraries. In Proceedings of the International Conference on Generative Programming: Concepts 8 Experiences (GPCE’13). 125--134.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Marshall C. Pease. 1968. An adaptation of the fast Fourier transform for parallel processing. J. ACM 15, 2 (1968), 252--264.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Markus Püschel, Peter A. Milder, and James C. Hoe. 2009. Permuting streaming data using RAMs. J. ACM 56, 2 (2009), 10:1--10:34.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Franz Franchetti, Frédéric de Mesmay, Daniel McFarlin, and Markus Püschel. 2009. Operator language: A program generation framework for fast kernels. In Proceedings of the IFIP Working Conference on Domain Specific Languages (DSL WC’09) (Lecture Notes in Computer Science), Vol. 5658. Springer, 385--410.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Philip Wadler and Stephen Blott. 1989. How to make ad-hoc polymorphism less ad hoc. In Proceedings of the 16th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 60--76.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Georg Ofenbeck, Tiark Rompf, and Markus Püschel. 2017. Staging for generic programming in space and time. In Proceedings of the International Conference on Generative Programming: Concepts 8 Experiences (GPCE’17). 15--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing hardware in a Scala embedded language. In Proceedings of the Design Automation Conference (DAC’12). 1216--1225.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. O. Port and Y. Etsion. 2017. DFiant: A dataflow hardware description language. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’17). 1--4.Google ScholarGoogle Scholar
  44. Arvind Sujeeth, HyoukJoong Lee, Kevin Brown, Tiark Rompf, Hassan Chafi, Michael Wu, Anand Atreya, Martin Odersky, and Kunle Olukotun. 2011. OptiML: An implicitly parallel domain-specific language for machine learning. In Proceedings of the International Conference on Machine Learning (ICML’11). 609--616.Google ScholarGoogle Scholar
  45. Nithin George, HyoukJoong Lee, David Novo, Muhsen Owaida, David Andrews, Kunle Olukotun, and Paolo Ienne. 2015. Automatic support for multi-module parallelism from computational patterns. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’15). 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  46. Nithin George, David Novo, Tiark Rompf, Martin Odersky, and Paolo Ienne. 2013. Making domain-specific hardware synthesis tools cost-efficient. In Proceedings of the International Conference on Field-Programmable Technology (FPT’13). 120--127.Google ScholarGoogle ScholarCross RefCross Ref
  47. Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’11). 33--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Geoffrey Mainland and Jeremy Johnson. 2017. A Haskell compiler for signal transforms. In Proceedings of the International Conference on Generative Programming: Concepts and Experiences (GPCE’17). 219--232.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Georg Ofenbeck. 2017. Generic Programming in Space and Time. Ph.D. Dissertation. ETH Zurich.Google ScholarGoogle Scholar
  50. Donald Ervin Knuth. 1978. The Art of Computer Programming (Addison-Wesley Series in Computer Science and Information, 2nd ed. Addison-Wesley Longman Publishing, Boston, MA.Google ScholarGoogle Scholar
  51. Rene Mueller, Jens Teubner, and Gustavo Alonso. 2012. Sorting networks on FPGAs. VLDB J. 21, 1 (2012), 1--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Ren Chen, Sruja Siriyal, and Viktor Prasanna. 2015. Energy and memory efficient mapping of bitonic sorting on FPGA. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’15). 240--249.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. J. Ortiz and D. Andrews. 2010. A configurable high-throughput linear sorter system. In Proceedings of the IEEE International Symposium on Parallel Distributed Processing, Workshops, and Ph.D. Forum (IPDPSW’10). 1--8.Google ScholarGoogle Scholar

Index Terms

  1. DSL-Based Hardware Generation with Scala: Example Fast Fourier Transforms and Sorting Networks

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!