skip to main content
research-article

A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad Memory

Published:03 October 2020Publication History
Skip Abstract Section

Abstract

Compiling sequential C programs for Connex-S, a competitive, scalable and customizable, wide vector accelerator for intensive embedded applications with 32 to 4,096 16-bit integer lanes and a limited capacity local scratchpad memory, is challenging.

Our compiler toolchain uses the LLVM framework and targets OPINCAA, a JIT vector assembler and coordination C++ library for Connex-S accelerating computations for an arbitrary CPU. Therefore, we address in the compiler middle end aspects of efficient vectorization, communication, and synchronization. We perform quantitative static analysis of the program useful, among others, for the symbolic-size compiler memory allocator and the coordination mechanism of OPINCAA. We also discuss the LLVM back end for the Connex-S processor and the methodology to automatically generate instruction selection code for emulating efficiently arithmetic and logical operations for non-native types such as 32-bit integer and 16-bit floating-point.

By using JIT vector assembling and by encoding the vector length of Connex-S as a parameter in the generated OPINCAA program, we achieve vector-length agnosticism to support execution on distinct embedded devices, such as several digital cameras with different resolutions, each equipped with custom-width Connex-S accelerators meant to save energy for the image processing kernels.

Since Connex-S has a limited capacity local scratchpad memory of 256 KB normally, we present how we also use the PPCG C-to-C code generator to perform data tiling to minimize the total kernel execution time, subject to fitting larger program data in the local memory. We devise an accurate cost model for the Connex-S accelerator to choose optimal performance tile sizes at compile time.

We successfully compile several simple benchmarks frequently used, for example, in high-performance and computer vision embedded applications. We report speedup factors of up to 11.33 when running them on a Connex-S accelerator with 128 16-bit integer lanes w.r.t. the dual-core ARM Cortex A9 host clocked at a frequency 6.67 times higher, with a total of two 128-bit Neon SIMD units.

References

  1. 2020. LLVM Documentation: TableGen. Retrieved from http://llvm.org/docs/TableGen/.Google ScholarGoogle Scholar
  2. 2020. The Polyhedral Model. Retrieved from http://polyhedral.info.Google ScholarGoogle Scholar
  3. 2017. Connex-S Accelerator Controller Specification.Google ScholarGoogle Scholar
  4. 2020. The Connex-S OPINCAA LLVM compiler. Retrieved from http://gitlab.dcae.pub.ro/research/ConnexRelated/OpincaaLLVM.Google ScholarGoogle Scholar
  5. 2020. The Connex OPINCAA library. Retrieved from http://gitlab.dcae.pub.ro/research/opincaa.Google ScholarGoogle Scholar
  6. Randy Allen and Ken Kennedy. 1987. Automatic translation of FORTRAN programs to vector form. ACM Trans. Program. Lang. Syst. 9, 4 (Oct. 1987), 491--542. DOI:https://doi.org/10.1145/29873.29875Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. ARM. 2017. ARM Compiler Version 6.8—Scalable Vector Extension User Guide.Google ScholarGoogle Scholar
  8. ARM Manchester Design Center. 2016. Support for Scalable Vector Architectures in LLVM IR.Google ScholarGoogle Scholar
  9. Adrià Armejach, Helena Caminal, Juan M. Cebrian, Rekai González-Alberquilla, Chris Adeniyi-Jones, Mateo Valero, Marc Casas, and Miquel Moretó. 2018. Stencil codes on a vector length agnostic architecture. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT’18). ACM, New York, 12 pages. DOI:https://doi.org/10.1145/3243176.3243192Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Krste Asanović. 1998. Vector Microprocessors. Ph.D. Dissertation. University of California, Berkeley.Google ScholarGoogle Scholar
  11. Krste Asanović and Roger Espasa. 2017. The RISC-V Vector ISA, 7th RISC-V Workshop.Google ScholarGoogle Scholar
  12. R. Auler, P. C. Centoducatte, and E. Borin. 2012. ACCGen: An automatic archc compiler generator. In Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High-Performance Computing. 278–285. DOI:https://doi.org/10.1109/SBAC-PAD.2012.33Google ScholarGoogle Scholar
  13. John Backus. 1978. Can programming be liberated from the Von Neumann style?: A functional style and its algebra of programs. Communication of the ACM. 21, 8 (Aug. 1978), 613–641. DOI:https://doi.org/10.1145/359576.359579Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. David Bacon, Rodric Rabbah, and Sunil Shukla. 2013. FPGA programming for the masses. ACM Queue 11, 2, Article 40 (Feb. 2013), 13 pages. DOI:https://doi.org/10.1145/2436696.2443836Google ScholarGoogle Scholar
  15. David F. Bacon, Susan L. Graham, and Oliver J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Comput. Surv. 26, 4 (Dec. 1994), 345–420. DOI:https://doi.org/10.1145/197405.197406Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, M. Balakrishnan, and Peter Marwedel. 2002. Scratchpad memory: Design alternative for cache on-chip memory in embedded systems. In Proceedings of the Tenth International Symposium on Hardware/Software Codesign (CODES'02). ACM, New York, 73–78. DOI:https://doi.org/10.1145/774789.774805Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Călin Bîră, Radu Hobincu, Lucian Petrică, Valeriu Codreanu, and Sorin Coţofană. 2014. Energy-efficient computation of L1 and L2 norms on FPGA SIMD accelerator, with applications to visual search. In Proceedings of the International Conference on Circuits, Systems, Communications and Computers (CSCC’14).Google ScholarGoogle Scholar
  18. Călin Bîră, Lucian Petrică, and Radu Hobincu. 2013. OPINCAA: A lightweight and flexible programming environment for parallel SIMD accelerators. Romanian Journal of Information Science and Technology 16, 4 (2013).Google ScholarGoogle Scholar
  19. Guy E. Blelloch. 1990. Vector Models for Data-parallel Computing. MIT Press, Cambridge, MA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Robert L. Bocchino, Jr. and Vikram S. Adve. 2006. Vector LLVA: A virtual vector instruction set for media processing. In Proceedings of the 2nd International Conference on Virtual Execution Environments (VEE’06). ACM, New York, 46–56. DOI:https://doi.org/10.1145/1134760.1134769Google ScholarGoogle Scholar
  21. David Brooks and Margaret Martonosi. 1999. Dynamically exploiting narrow width operands to improve processor power and performance. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture (HPCA’99). IEEE Computer Society, Washington, DC, 13. http://dl.acm.org/citation.cfm?id=520549.822763.Google ScholarGoogle ScholarCross RefCross Ref
  22. A. Burlacu-Zane. 2015. Hardware loop and loop skip generation algorithm for the star core architecture: Architecture, application and compiler design interaction in the embedded domain. In Proceedings of the 20th International Conference on Control Systems and Computer Science (CSCS’15). 273–278. DOI:https://doi.org/10.1109/CSCS.2015.40Google ScholarGoogle ScholarCross RefCross Ref
  23. G. J. Burnett and E. G. Coffman, Jr. 1970. A study of interleaved memory systems. In Proceedings of the Spring Joint Computer Conference (AFIPS’70). ACM, New York, 467–474. DOI:https://doi.org/10.1145/1476936.1477008Google ScholarGoogle Scholar
  24. C. Caşcaval, S. Chatterjee, H. Franke, K. J. Gildea, and P. Pattnaik. 2010. A taxonomy of accelerator architectures and their programming models. IBM J. Res. Dev. 54, 5 (Sept. 2010), 473–482. DOI:https://doi.org/10.1147/JRD.2010.2059721Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. P. Chandrakasan, S. Sheng, and R. W. Brodersen. 1992. Low-power CMOS digital design. IEEE J. Solid-State Circ. 27, 4 (Apr. 1992), 473–484. DOI:https://doi.org/10.1109/4.126534Google ScholarGoogle ScholarCross RefCross Ref
  26. Alex E. Şuşu. 2019. Compiling efficiently with arithmetic emulation for the custom-width connex vector processor. In Proceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing (WPMVP’19). ACM, New York, 8 pages. DOI:https://doi.org/10.1145/3303117.3306166Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Giovanni De Micheli, Rolf Ernst, and Wayne Wolf (Eds.). 2002. Readings in Hardware/Software Co-design. Kluwer Academic Publishers, Norwell, MA.Google ScholarGoogle Scholar
  28. Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’04), Vol. 6. USENIX Association, Berkeley, 10–10. http://dl.acm.org/citation.cfm?id=1251254.1251264.Google ScholarGoogle Scholar
  29. Jack Dongarra, Ian Foster, Geoffrey Fox, William Gropp, Ken Kennedy, Linda Torczon, and Andy White (Eds.). 2003. Sourcebook of Parallel Computing. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google ScholarGoogle Scholar
  30. Alexandre E. Eichenberger, Kathryn O'Brien, Kevin O'Brien, Peng Wu, Tong Chen, Peter H. Oden, Daniel A. Prener, Janice C. Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, and Michael Gschwind. 2005. Optimizing compiler for the CELL processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05). IEEE Computer Society, Washington, DC, 161–172. DOI:https://doi.org/10.1109/PACT.2005.33Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Miloš D. Ercegovac and Tomás Lang. 2003. Digital Arithmetic (1st ed.). Morgan Kaufmann. Publishers Inc., San Francisco, CA.Google ScholarGoogle Scholar
  32. Paolo Faraboschi, Geoffrey Brown, Joseph A. Fisher, Giuseppe Desoli, and Fred Homewood. 2000. Lx: A technology platform for customizable VLIW embedded processing. SIGARCH Computer Architecture News 28, 2 (2000), 203–213. DOI:https://doi.org/10.1145/342001.339682Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Roger Ferrer, Vicenç Beltran, Marc Gonzàlez, Xavier Martorell, and Eduard Ayguadé. 2010. Analysis of Task Offloading for Accelerators. Springer Berlin Heidelberg, Berlin, 322–336. DOI:https://doi.org/10.1007/978-3-642-11515-8_24Google ScholarGoogle Scholar
  34. Joseph A. Fisher, Paolo Faraboschi, and Clifford Young. 2005. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Francesco Petrogalli. 2016. A Sneak Peek into SVE and VLA Programming, ARM White Paper.Google ScholarGoogle Scholar
  36. Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). IEEE Computer Society, Washington, DC, 96–106. DOI:https://doi.org/10.1109/MICRO.2012.18Google ScholarGoogle Scholar
  37. Gheorghe M. Ştefan. 2015. The Connex Instruction Set Architecture. (document included in the OPINCAA library distribution).Google ScholarGoogle Scholar
  38. Gheorghe M. Ştefan. 2019. Functional Electronics course. Retrieved from http://users.dcae.pub.ro/~gstefan/2ndLevel/functional_electronics.html.Google ScholarGoogle Scholar
  39. C. Gou and G. N. Gaydadjiev. 2013. Addressing GPU On-chip shared memory bank conflicts using elastic pipeline. International Journal of Parallel Programming 41 (2013), 400–429. DOI:https://doi.org/10.1007/s10766-012-0201-1Google ScholarGoogle ScholarCross RefCross Ref
  40. M. Annaratone, E. Arnould, T. Gross, H. T. Kung, and M. Lam. 1987. The warp computer: Architecture, implementation, and performance. IEEE Trans. Comput. 36, 12 (Dec. 1987), 1523–1538. DOI:https://doi.org/10.1109/TC.1987.5009502Google ScholarGoogle Scholar
  41. Tobias Grosser and Torsten Hoefler. 2016. Polly-ACC transparent compilation to heterogeneous hardware. In Proceedings of the International Conference on Supercomputing (ICS’16). ACM, New York, 13 pages. DOI:https://doi.org/10.1145/2925426.2926286Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Michael Gschwind, H. Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio Watanabe, and Takeshi Yamazaki. 2006. Synergistic processing in cell’s multicore architecture. IEEE Micro 26, 2 (March 2006), 10–24. DOI:https://doi.org/10.1109/MM.2006.41Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Frank Hannig, Vahid Lari, Srinivas Boppu, Alexandru Tanase, and Oliver Reiche. 2014. Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach. ACM Trans. Embed. Comput. Syst. 13, 4s, Article 133 (April 2014), 29 pages. DOI:https://doi.org/10.1145/2584660Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. John Hauser. 2020. SoftFloat. Retrieved from http://www.jhauser.us/arithmetic/SoftFloat.html.Google ScholarGoogle Scholar
  45. Arthur Hennequin, Ian Masliah, and Lionel Lacassagne. 2019. Designing efficient SIMD algorithms for direct connected component labeling. In Proceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing (WPMVP’19). ACM, New York, 8 pages. DOI:https://doi.org/10.1145/3303117.3306164Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. John L. Hennessy and David A. Patterson. 2017. Computer Architecture, Sixth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. H. Inoue. 2016. How SIMD width affects energy efficiency: A case study on sorting. In Proceedings of the IEEE Symposium on Low-Power and High-Speed Chips and Systems (COOL CHIPS XIX). 1--3. DOI:https://doi.org/10.1109/CoolChips.2016.7503679Google ScholarGoogle ScholarCross RefCross Ref
  48. Raj Jain. 1991. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley. Retrieved from https://books.google.ro/books?id=eOR0kJjgMqkC.Google ScholarGoogle Scholar
  49. Jeff Johnson. 2017. Making floating point math highly efficient for AI hardware. Retrieved from https://code.fb.com/ai-research/floating-point-math/.Google ScholarGoogle Scholar
  50. Mahmut Kandemir, Ismail Kadayif, and Ugur Sezer. 2001. Exploiting scratch-pad memory using presburger formulas. In Proceedings of the 14th International Symposium on Systems Synthesis (ISSS'01). ACM, New York, 7–12. DOI:https://doi.org/10.1145/500001.500004Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Kingshuk Karuri, Rainer Leupers, Gerd Ascheid, Heinrich Meyr, and Monu Kedia. 2006. Design and implementation of a modular and portable IEEE 754 compliant floating-point unit. In Proceedings of the Conference on Design, Automation and Test in Europe: Designers. Forum (DATE'06). European Design and Automation Association, Leuven, 221–226. http://dl.acm.org/citation.cfm?id=1131355.1131404Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. 1997. Data-centric multi-level blocking. In Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation (PLDI’97). ACM, New York, 346--357. DOI:https://doi.org/10.1145/258915.258946Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Christoforos E. Kozyrakis and David A. Patterson. 2003. Scalable vector processors for embedded systems. IEEE Micro 23, 6 (Nov. 2003), 36--45. DOI:https://doi.org/10.1109/MM.2003.1261385Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Olaf Krzikalla, Kim Feldhoff, Ralph Müller-Pfefferkorn, and Wolfgang E. Nagel. 2011. Scout: A source-to-source transformator for SIMD-optimizations. In Proceedings of the International European Conference on Parallel and Distributed Computing (Euro-Par’11), Vol. 2. Springer-Verlag, Berlin, Heidelberg, 137--145.Google ScholarGoogle Scholar
  55. Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. 1994. Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin-Cummings Publishing Co., Inc., Redwood City, CA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Ian Kuon and Jonathan Rose. 2006. Measuring the gap between FPGAs and ASICs. In Proceedings of the ACM/SIGDA 14th International Symposium on Field-Programmable Gate Arrays (FPGA’06). ACM, New York, 21--30. DOI:https://doi.org/10.1145/1117201.1117205Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Samuel Larsen, Emmett Witchel, and Saman P. Amarasinghe. 2002. Increasing and detecting memory address congruence. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’02). IEEE Computer Society, Washington, DC, 18--29. http://dl.acm.org/citation.cfm?id=645989.674329.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’04). IEEE Computer Society, Washington, DC, 75. http://dl.acm.org/citation.cfm?id=977395.977673.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Lian Li, Hui Wu, Hui Feng, and Jingling Xue. 2007. Towards data tiling for whole programs in scratchpad memory allocation. In Proceedings of the 12th Asia-Pacific Conference on Advances in Computer Systems Architecture (ACSAC’07). Springer-Verlag, Berlin, Heidelberg, 63--74. http://dl.acm.org/citation.cfm?id=2392163.2392171Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Haibo Lin, Tao Liu, Lakshminarayanan Renganarayana, Huoding Li, Tong Chen, Kevin O'Brien, and Ling Shao. 2011. Automatic loop tiling for direct memory access. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’11). 479--489. DOI:https://doi.org/10.1109/IPDPS.2011.53Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Haibo Lin, Tao Liu, Huoding Li, Tong Chen, Lakshminarayanan Renganarayana, John Kevin O'Brien, and Ling Shao. 2010. DMATiler: Revisiting loop tiling for direct memory access. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, 559--560. DOI:https://doi.org/10.1145/1854273.1854351Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. The Linux Kernel Archives. 2014. Linux Socket Filtering aka Berkeley Packet Filter (BPF). Retrieved from https://www.kernel.org/doc/Documentation/networking/filter.txt.Google ScholarGoogle Scholar
  63. Tao Liu, Haibo Lin, Tong Chen, John Kevin O’Brien, and Ling Shao. 2009. DBDB: Optimizing DMATransfer for the cell be architecture. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM, New York, 36--45. DOI:https://doi.org/10.1145/1542275.1542286Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Bruno Cardoso Lopes and Rafael Auler. 2014. Getting Started with LLVM Core Libraries. Packt Publishing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. 2016. Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw. 43, 2, Article 12 (Aug. 2016), 18 pages. DOI:https://doi.org/10.1145/2925987Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J. Dally, and Mark Horowitz. 2000. Smart memories: A modular reconfigurable architecture. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). ACM, New York, 161--171. DOI:https://doi.org/10.1145/339647.339673Google ScholarGoogle ScholarCross RefCross Ref
  67. M. Maliţa and G. M. Ştefan. 2017. Map-scan node accelerator for big-data. In Proceedings of the IEEE International Conference on Big Data (Big Data’17). 3524--3529. DOI:https://doi.org/10.1109/BigData.2017.8258342Google ScholarGoogle ScholarCross RefCross Ref
  68. M. Malita, G. M. Ştefan, and M. Stoian. 2006. Complex vs. intensive in parallel computation. In Proceedings of the International Multi-Conference on Computing in the Global Information Technology (ICCGI’06). 26--26. DOI:https://doi.org/10.1109/ICCGI.2006.16Google ScholarGoogle Scholar
  69. Steven McCanne and Van Jacobson. 1993. The BSD packet filter: A new architecture for user-level packet capture. In Proceedings of the USENIX Annual Technical Conference (USENIX’93). USENIX Association, Berkeley, 2--2. http://dl.acm.org/citation.cfm?id=1267303.1267305.Google ScholarGoogle Scholar
  70. Michael McCool, James Reinders, and Arch Robison. 2012. Structured Parallel Programming: Patterns for Efficient Computation (1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA.Google ScholarGoogle Scholar
  71. Gleison Mendonça, Breno Guimarães, Péricles Alves, Márcio Pereira, Guido Araújo, and Fernando Magno Quintão Pereira. 2017. DawnCC: Automatic annotation for data parallelism and offloading. ACM Trans. Archit. Code Optim. 14, 2, Article 13 (May 2017), 25 pages. DOI:https://doi.org/10.1145/3084540Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Giovanni De Micheli. 1994. Synthesis and Optimization of Digital Circuits. McGraw–Hill Higher Education.Google ScholarGoogle Scholar
  73. Sparsh Mittal. 2017. A survey of techniques for architecting and managing GPU register file. IEEE Trans. Parallel Distrib. Syst. 28, 1 (Jan. 2017), 16--28. DOI:https://doi.org/10.1109/TPDS.2016.2546249Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Steven S. Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Aaftab Munshi, Benedict Gaster, Timothy G. Mattson, James Fung, and Dan Ginsburg. 2011. OpenCL Programming Guide (1st ed.). Addison-Wesley Professional.Google ScholarGoogle Scholar
  76. Dorit Naishlos. 2004. Autovectorization in GCC. In Proceedings of the 2004 GCC Developers Summit.Google ScholarGoogle Scholar
  77. V. Krishna Nandivada and Rajkishore Barik. 2013. Improved bitwidth-aware variable packing. ACM Trans. Archit. Code Optim. 10, 3, Article 16 (Sept. 2013), 22 pages. DOI:https://doi.org/10.1145/2509420.2509427Google ScholarGoogle Scholar
  78. Henrique Nazaré, Izabela Maffra, Willer Santos, Leonardo Barbosa, Laure Gonnord, and Fernando Magno Quintão Pereira. 2014. Validation of memory accesses through symbolic analyses. In Proceedings of the ACM International Conference on Object-Oriented Programming, Systems, Languages 8 Applications (OOPSLA’14). ACM, New York, 791--809. DOI:https://doi.org/10.1145/2660193.2660205Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Dorit Nuzman and Richard Henderson. 2006. Multi-platform auto-vectorization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE Computer Society, Washington, DC, 281.294.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-vectorize once, run everywhere. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE Computer Society, Washington, DC, 151--160. http://dl.acm.org/citation.cfm?id=2190025.2190062.Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. NVIDIA. 2018. NVIDIA Turing GPU Architecture, Graphics Reinvented. White paper WP-09183-001_v01. Retrieved from http://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turingarchitecture/NVIDIA-Turing-Architecture-Whitepaper.pdf.Google ScholarGoogle Scholar
  82. Mayur Pandey and Suyog Sarda. 2015. LLVM Cookbook. Packt.Google ScholarGoogle Scholar
  83. David A. Patterson and John L. Hennessy. 2013. Computer Organization and Design: The Hardware/Software Interface (5th ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. David A. Patterson and John L. Hennessy. 2017. Computer Organization and Design RISC-V Edition: The Hardware Software Interface (1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. L. Petrică, V. Codreanu, S. Coţofană. 2013. VASILE: A reconfigurable vector architecture for instruction level frequency scaling. In Faible Tension Faible Consommation (FTFC'13). IEEE, 1--4. DOI:https://doi.org/10.1109/FTFC.2013.6577772Google ScholarGoogle Scholar
  86. Louis-Noël Pouchet. 2014. PolyBench: The Polyhedral Benchmark Suite. Retrieved from https://web.cse.ohiostate.edu/~pouchet.2/software/polybench/.Google ScholarGoogle Scholar
  87. Randolf G. Scarborough and Harwood G. Kolsky. 1986. A vectorizing fortran compiler. IBM J. Res. Dev. 30, 2 (March 1986), 163--171. DOI:https://doi.org/10.1147/rd.302.0163Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Selim G. Akl. 1989. The Design and Analysis of Parallel Algorithms. Prentice-Hall, Upper Saddle River, NJ.Google ScholarGoogle Scholar
  89. J. P. Shen and M. H. Lipasti. 2005. Modern Processor Design: Fundamentals of Superscalar Processors. Waveland Press. Retrieved from https://books.google.ro/books?id=ffQqAAAAQBAJ.Google ScholarGoogle Scholar
  90. Jun Shirako, Kamal Sharma, Naznin Fauzia, Louis-Noël Pouchet, J. Ramanujam, P. Sadayappan, and Vivek Sarkar. 2012. Analytical bounds for optimal tile size selection. In Proceedings of the 21st International Conference on Compiler Construction (CC’12). Springer-Verlag, Berlin, Heidelberg, 101--121. DOI:https://doi.org/10.1007/978-3-642-28652-0_6Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Moritz Sinn, Florian Zuleger, and Helmut Veith. 2017. Complexity and resource bound analysis of imperative programs using difference constraints. Journal of Automated Reasoning 59, 1 (June 2017), 3--45. DOI:https://doi.org/10.1007/s10817-016-9402-4Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. David B. Skillicorn and Domenico Talia. 1998. Models and languages for parallel computation. ACM Comput. Surv. 30, 2 (June 1998), 123--169. DOI:https://doi.org/10.1145/280277.280278Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Gheorghe M. Ştefan and Mihaela Maliţa. 2014. Can one-chip parallel computing be liberated from ad hoc solutions? A computation model based approach and its implementation. In Proceedings of the 18th International Conference on Circuits, Systems, Communications and Computers (CSCC’14). 582--597.Google ScholarGoogle Scholar
  94. Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, Nathanael Premillieu, Alastair Reid, Alejandro Rico, and Paul Walker. 2017. The ARM scalable vector extension. IEEE Micro 37, 2 (March 2017), 26--39. DOI:https://doi.org/10.1109/MM.2017.35Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. J. Teubner, R. Mueller, and G. Alonso. 2010. FPGA acceleration for the frequent item problem. In Proceedings of the IEEE 26th International Conference on Data Engineering (ICDE’10). 669--680. DOI:https://doi.org/10.1109/ICDE.2010.5447856Google ScholarGoogle Scholar
  96. Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4, Article 54 (Jan. 2013), 23 pages. DOI:https://doi.org/10.1145/2400682.2400713Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Luc Waeijen, Dongrui She, Henk Corporaal, and Yifan He. 2015. A low-energy wide SIMD architecture with explicit datapath. J. Sign. Process. Syst. 80, 1 (July 2015), 65--86. DOI:https://doi.org/10.1007/s11265-014-0950-8Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Andrew Waterman. 2016. Design of the RISC-V Instruction Set Architecture. Ph.D. Dissertation. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-1.html.Google ScholarGoogle Scholar
  99. Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanović. 2014. The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.0. Technical Report UCB/EECS-2014-54. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-54.html.Google ScholarGoogle Scholar
  100. Jingling Xue. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers, Norwell, MA.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad Memory

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!