Abstract
Compiling sequential C programs for Connex-S, a competitive, scalable and customizable, wide vector accelerator for intensive embedded applications with 32 to 4,096 16-bit integer lanes and a limited capacity local scratchpad memory, is challenging.
Our compiler toolchain uses the LLVM framework and targets OPINCAA, a JIT vector assembler and coordination C++ library for Connex-S accelerating computations for an arbitrary CPU. Therefore, we address in the compiler middle end aspects of efficient vectorization, communication, and synchronization. We perform quantitative static analysis of the program useful, among others, for the symbolic-size compiler memory allocator and the coordination mechanism of OPINCAA. We also discuss the LLVM back end for the Connex-S processor and the methodology to automatically generate instruction selection code for emulating efficiently arithmetic and logical operations for non-native types such as 32-bit integer and 16-bit floating-point.
By using JIT vector assembling and by encoding the vector length of Connex-S as a parameter in the generated OPINCAA program, we achieve vector-length agnosticism to support execution on distinct embedded devices, such as several digital cameras with different resolutions, each equipped with custom-width Connex-S accelerators meant to save energy for the image processing kernels.
Since Connex-S has a limited capacity local scratchpad memory of 256 KB normally, we present how we also use the PPCG C-to-C code generator to perform data tiling to minimize the total kernel execution time, subject to fitting larger program data in the local memory. We devise an accurate cost model for the Connex-S accelerator to choose optimal performance tile sizes at compile time.
We successfully compile several simple benchmarks frequently used, for example, in high-performance and computer vision embedded applications. We report speedup factors of up to 11.33 when running them on a Connex-S accelerator with 128 16-bit integer lanes w.r.t. the dual-core ARM Cortex A9 host clocked at a frequency 6.67 times higher, with a total of two 128-bit Neon SIMD units.
- 2020. LLVM Documentation: TableGen. Retrieved from http://llvm.org/docs/TableGen/.Google Scholar
- 2020. The Polyhedral Model. Retrieved from http://polyhedral.info.Google Scholar
- 2017. Connex-S Accelerator Controller Specification.Google Scholar
- 2020. The Connex-S OPINCAA LLVM compiler. Retrieved from http://gitlab.dcae.pub.ro/research/ConnexRelated/OpincaaLLVM.Google Scholar
- 2020. The Connex OPINCAA library. Retrieved from http://gitlab.dcae.pub.ro/research/opincaa.Google Scholar
- Randy Allen and Ken Kennedy. 1987. Automatic translation of FORTRAN programs to vector form. ACM Trans. Program. Lang. Syst. 9, 4 (Oct. 1987), 491--542. DOI:https://doi.org/10.1145/29873.29875Google Scholar
Digital Library
- ARM. 2017. ARM Compiler Version 6.8—Scalable Vector Extension User Guide.Google Scholar
- ARM Manchester Design Center. 2016. Support for Scalable Vector Architectures in LLVM IR.Google Scholar
- Adrià Armejach, Helena Caminal, Juan M. Cebrian, Rekai González-Alberquilla, Chris Adeniyi-Jones, Mateo Valero, Marc Casas, and Miquel Moretó. 2018. Stencil codes on a vector length agnostic architecture. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT’18). ACM, New York, 12 pages. DOI:https://doi.org/10.1145/3243176.3243192Google Scholar
Digital Library
- Krste Asanović. 1998. Vector Microprocessors. Ph.D. Dissertation. University of California, Berkeley.Google Scholar
- Krste Asanović and Roger Espasa. 2017. The RISC-V Vector ISA, 7th RISC-V Workshop.Google Scholar
- R. Auler, P. C. Centoducatte, and E. Borin. 2012. ACCGen: An automatic archc compiler generator. In Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High-Performance Computing. 278–285. DOI:https://doi.org/10.1109/SBAC-PAD.2012.33Google Scholar
- John Backus. 1978. Can programming be liberated from the Von Neumann style?: A functional style and its algebra of programs. Communication of the ACM. 21, 8 (Aug. 1978), 613–641. DOI:https://doi.org/10.1145/359576.359579Google Scholar
Digital Library
- David Bacon, Rodric Rabbah, and Sunil Shukla. 2013. FPGA programming for the masses. ACM Queue 11, 2, Article 40 (Feb. 2013), 13 pages. DOI:https://doi.org/10.1145/2436696.2443836Google Scholar
- David F. Bacon, Susan L. Graham, and Oliver J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Comput. Surv. 26, 4 (Dec. 1994), 345–420. DOI:https://doi.org/10.1145/197405.197406Google Scholar
Digital Library
- Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, M. Balakrishnan, and Peter Marwedel. 2002. Scratchpad memory: Design alternative for cache on-chip memory in embedded systems. In Proceedings of the Tenth International Symposium on Hardware/Software Codesign (CODES'02). ACM, New York, 73–78. DOI:https://doi.org/10.1145/774789.774805Google Scholar
Digital Library
- Călin Bîră, Radu Hobincu, Lucian Petrică, Valeriu Codreanu, and Sorin Coţofană. 2014. Energy-efficient computation of L1 and L2 norms on FPGA SIMD accelerator, with applications to visual search. In Proceedings of the International Conference on Circuits, Systems, Communications and Computers (CSCC’14).Google Scholar
- Călin Bîră, Lucian Petrică, and Radu Hobincu. 2013. OPINCAA: A lightweight and flexible programming environment for parallel SIMD accelerators. Romanian Journal of Information Science and Technology 16, 4 (2013).Google Scholar
- Guy E. Blelloch. 1990. Vector Models for Data-parallel Computing. MIT Press, Cambridge, MA.Google Scholar
Digital Library
- Robert L. Bocchino, Jr. and Vikram S. Adve. 2006. Vector LLVA: A virtual vector instruction set for media processing. In Proceedings of the 2nd International Conference on Virtual Execution Environments (VEE’06). ACM, New York, 46–56. DOI:https://doi.org/10.1145/1134760.1134769Google Scholar
- David Brooks and Margaret Martonosi. 1999. Dynamically exploiting narrow width operands to improve processor power and performance. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture (HPCA’99). IEEE Computer Society, Washington, DC, 13. http://dl.acm.org/citation.cfm?id=520549.822763.Google Scholar
Cross Ref
- A. Burlacu-Zane. 2015. Hardware loop and loop skip generation algorithm for the star core architecture: Architecture, application and compiler design interaction in the embedded domain. In Proceedings of the 20th International Conference on Control Systems and Computer Science (CSCS’15). 273–278. DOI:https://doi.org/10.1109/CSCS.2015.40Google Scholar
Cross Ref
- G. J. Burnett and E. G. Coffman, Jr. 1970. A study of interleaved memory systems. In Proceedings of the Spring Joint Computer Conference (AFIPS’70). ACM, New York, 467–474. DOI:https://doi.org/10.1145/1476936.1477008Google Scholar
- C. Caşcaval, S. Chatterjee, H. Franke, K. J. Gildea, and P. Pattnaik. 2010. A taxonomy of accelerator architectures and their programming models. IBM J. Res. Dev. 54, 5 (Sept. 2010), 473–482. DOI:https://doi.org/10.1147/JRD.2010.2059721Google Scholar
Digital Library
- A. P. Chandrakasan, S. Sheng, and R. W. Brodersen. 1992. Low-power CMOS digital design. IEEE J. Solid-State Circ. 27, 4 (Apr. 1992), 473–484. DOI:https://doi.org/10.1109/4.126534Google Scholar
Cross Ref
- Alex E. Şuşu. 2019. Compiling efficiently with arithmetic emulation for the custom-width connex vector processor. In Proceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing (WPMVP’19). ACM, New York, 8 pages. DOI:https://doi.org/10.1145/3303117.3306166Google Scholar
Digital Library
- Giovanni De Micheli, Rolf Ernst, and Wayne Wolf (Eds.). 2002. Readings in Hardware/Software Co-design. Kluwer Academic Publishers, Norwell, MA.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’04), Vol. 6. USENIX Association, Berkeley, 10–10. http://dl.acm.org/citation.cfm?id=1251254.1251264.Google Scholar
- Jack Dongarra, Ian Foster, Geoffrey Fox, William Gropp, Ken Kennedy, Linda Torczon, and Andy White (Eds.). 2003. Sourcebook of Parallel Computing. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google Scholar
- Alexandre E. Eichenberger, Kathryn O'Brien, Kevin O'Brien, Peng Wu, Tong Chen, Peter H. Oden, Daniel A. Prener, Janice C. Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, and Michael Gschwind. 2005. Optimizing compiler for the CELL processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05). IEEE Computer Society, Washington, DC, 161–172. DOI:https://doi.org/10.1109/PACT.2005.33Google Scholar
Digital Library
- Miloš D. Ercegovac and Tomás Lang. 2003. Digital Arithmetic (1st ed.). Morgan Kaufmann. Publishers Inc., San Francisco, CA.Google Scholar
- Paolo Faraboschi, Geoffrey Brown, Joseph A. Fisher, Giuseppe Desoli, and Fred Homewood. 2000. Lx: A technology platform for customizable VLIW embedded processing. SIGARCH Computer Architecture News 28, 2 (2000), 203–213. DOI:https://doi.org/10.1145/342001.339682Google Scholar
Digital Library
- Roger Ferrer, Vicenç Beltran, Marc Gonzàlez, Xavier Martorell, and Eduard Ayguadé. 2010. Analysis of Task Offloading for Accelerators. Springer Berlin Heidelberg, Berlin, 322–336. DOI:https://doi.org/10.1007/978-3-642-11515-8_24Google Scholar
- Joseph A. Fisher, Paolo Faraboschi, and Clifford Young. 2005. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google Scholar
Digital Library
- Francesco Petrogalli. 2016. A Sneak Peek into SVE and VLA Programming, ARM White Paper.Google Scholar
- Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). IEEE Computer Society, Washington, DC, 96–106. DOI:https://doi.org/10.1109/MICRO.2012.18Google Scholar
- Gheorghe M. Ştefan. 2015. The Connex Instruction Set Architecture. (document included in the OPINCAA library distribution).Google Scholar
- Gheorghe M. Ştefan. 2019. Functional Electronics course. Retrieved from http://users.dcae.pub.ro/~gstefan/2ndLevel/functional_electronics.html.Google Scholar
- C. Gou and G. N. Gaydadjiev. 2013. Addressing GPU On-chip shared memory bank conflicts using elastic pipeline. International Journal of Parallel Programming 41 (2013), 400–429. DOI:https://doi.org/10.1007/s10766-012-0201-1Google Scholar
Cross Ref
- M. Annaratone, E. Arnould, T. Gross, H. T. Kung, and M. Lam. 1987. The warp computer: Architecture, implementation, and performance. IEEE Trans. Comput. 36, 12 (Dec. 1987), 1523–1538. DOI:https://doi.org/10.1109/TC.1987.5009502Google Scholar
- Tobias Grosser and Torsten Hoefler. 2016. Polly-ACC transparent compilation to heterogeneous hardware. In Proceedings of the International Conference on Supercomputing (ICS’16). ACM, New York, 13 pages. DOI:https://doi.org/10.1145/2925426.2926286Google Scholar
Digital Library
- Michael Gschwind, H. Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio Watanabe, and Takeshi Yamazaki. 2006. Synergistic processing in cell’s multicore architecture. IEEE Micro 26, 2 (March 2006), 10–24. DOI:https://doi.org/10.1109/MM.2006.41Google Scholar
Digital Library
- Frank Hannig, Vahid Lari, Srinivas Boppu, Alexandru Tanase, and Oliver Reiche. 2014. Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler co-design approach. ACM Trans. Embed. Comput. Syst. 13, 4s, Article 133 (April 2014), 29 pages. DOI:https://doi.org/10.1145/2584660Google Scholar
Digital Library
- John Hauser. 2020. SoftFloat. Retrieved from http://www.jhauser.us/arithmetic/SoftFloat.html.Google Scholar
- Arthur Hennequin, Ian Masliah, and Lionel Lacassagne. 2019. Designing efficient SIMD algorithms for direct connected component labeling. In Proceedings of the 5th Workshop on Programming Models for SIMD/Vector Processing (WPMVP’19). ACM, New York, 8 pages. DOI:https://doi.org/10.1145/3303117.3306164Google Scholar
Digital Library
- John L. Hennessy and David A. Patterson. 2017. Computer Architecture, Sixth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google Scholar
Digital Library
- H. Inoue. 2016. How SIMD width affects energy efficiency: A case study on sorting. In Proceedings of the IEEE Symposium on Low-Power and High-Speed Chips and Systems (COOL CHIPS XIX). 1--3. DOI:https://doi.org/10.1109/CoolChips.2016.7503679Google Scholar
Cross Ref
- Raj Jain. 1991. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley. Retrieved from https://books.google.ro/books?id=eOR0kJjgMqkC.Google Scholar
- Jeff Johnson. 2017. Making floating point math highly efficient for AI hardware. Retrieved from https://code.fb.com/ai-research/floating-point-math/.Google Scholar
- Mahmut Kandemir, Ismail Kadayif, and Ugur Sezer. 2001. Exploiting scratch-pad memory using presburger formulas. In Proceedings of the 14th International Symposium on Systems Synthesis (ISSS'01). ACM, New York, 7–12. DOI:https://doi.org/10.1145/500001.500004Google Scholar
Digital Library
- Kingshuk Karuri, Rainer Leupers, Gerd Ascheid, Heinrich Meyr, and Monu Kedia. 2006. Design and implementation of a modular and portable IEEE 754 compliant floating-point unit. In Proceedings of the Conference on Design, Automation and Test in Europe: Designers. Forum (DATE'06). European Design and Automation Association, Leuven, 221–226. http://dl.acm.org/citation.cfm?id=1131355.1131404Google Scholar
Digital Library
- Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. 1997. Data-centric multi-level blocking. In Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation (PLDI’97). ACM, New York, 346--357. DOI:https://doi.org/10.1145/258915.258946Google Scholar
Digital Library
- Christoforos E. Kozyrakis and David A. Patterson. 2003. Scalable vector processors for embedded systems. IEEE Micro 23, 6 (Nov. 2003), 36--45. DOI:https://doi.org/10.1109/MM.2003.1261385Google Scholar
Digital Library
- Olaf Krzikalla, Kim Feldhoff, Ralph Müller-Pfefferkorn, and Wolfgang E. Nagel. 2011. Scout: A source-to-source transformator for SIMD-optimizations. In Proceedings of the International European Conference on Parallel and Distributed Computing (Euro-Par’11), Vol. 2. Springer-Verlag, Berlin, Heidelberg, 137--145.Google Scholar
- Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. 1994. Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin-Cummings Publishing Co., Inc., Redwood City, CA.Google Scholar
Digital Library
- Ian Kuon and Jonathan Rose. 2006. Measuring the gap between FPGAs and ASICs. In Proceedings of the ACM/SIGDA 14th International Symposium on Field-Programmable Gate Arrays (FPGA’06). ACM, New York, 21--30. DOI:https://doi.org/10.1145/1117201.1117205Google Scholar
Digital Library
- Samuel Larsen, Emmett Witchel, and Saman P. Amarasinghe. 2002. Increasing and detecting memory address congruence. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’02). IEEE Computer Society, Washington, DC, 18--29. http://dl.acm.org/citation.cfm?id=645989.674329.Google Scholar
Digital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’04). IEEE Computer Society, Washington, DC, 75. http://dl.acm.org/citation.cfm?id=977395.977673.Google Scholar
Digital Library
- Lian Li, Hui Wu, Hui Feng, and Jingling Xue. 2007. Towards data tiling for whole programs in scratchpad memory allocation. In Proceedings of the 12th Asia-Pacific Conference on Advances in Computer Systems Architecture (ACSAC’07). Springer-Verlag, Berlin, Heidelberg, 63--74. http://dl.acm.org/citation.cfm?id=2392163.2392171Google Scholar
Digital Library
- Haibo Lin, Tao Liu, Lakshminarayanan Renganarayana, Huoding Li, Tong Chen, Kevin O'Brien, and Ling Shao. 2011. Automatic loop tiling for direct memory access. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’11). 479--489. DOI:https://doi.org/10.1109/IPDPS.2011.53Google Scholar
Digital Library
- Haibo Lin, Tao Liu, Huoding Li, Tong Chen, Lakshminarayanan Renganarayana, John Kevin O'Brien, and Ling Shao. 2010. DMATiler: Revisiting loop tiling for direct memory access. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, New York, 559--560. DOI:https://doi.org/10.1145/1854273.1854351Google Scholar
Digital Library
- The Linux Kernel Archives. 2014. Linux Socket Filtering aka Berkeley Packet Filter (BPF). Retrieved from https://www.kernel.org/doc/Documentation/networking/filter.txt.Google Scholar
- Tao Liu, Haibo Lin, Tong Chen, John Kevin O’Brien, and Ling Shao. 2009. DBDB: Optimizing DMATransfer for the cell be architecture. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM, New York, 36--45. DOI:https://doi.org/10.1145/1542275.1542286Google Scholar
Digital Library
- Bruno Cardoso Lopes and Rafael Auler. 2014. Getting Started with LLVM Core Libraries. Packt Publishing.Google Scholar
Digital Library
- Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. 2016. Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Softw. 43, 2, Article 12 (Aug. 2016), 18 pages. DOI:https://doi.org/10.1145/2925987Google Scholar
Digital Library
- Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J. Dally, and Mark Horowitz. 2000. Smart memories: A modular reconfigurable architecture. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). ACM, New York, 161--171. DOI:https://doi.org/10.1145/339647.339673Google Scholar
Cross Ref
- M. Maliţa and G. M. Ştefan. 2017. Map-scan node accelerator for big-data. In Proceedings of the IEEE International Conference on Big Data (Big Data’17). 3524--3529. DOI:https://doi.org/10.1109/BigData.2017.8258342Google Scholar
Cross Ref
- M. Malita, G. M. Ştefan, and M. Stoian. 2006. Complex vs. intensive in parallel computation. In Proceedings of the International Multi-Conference on Computing in the Global Information Technology (ICCGI’06). 26--26. DOI:https://doi.org/10.1109/ICCGI.2006.16Google Scholar
- Steven McCanne and Van Jacobson. 1993. The BSD packet filter: A new architecture for user-level packet capture. In Proceedings of the USENIX Annual Technical Conference (USENIX’93). USENIX Association, Berkeley, 2--2. http://dl.acm.org/citation.cfm?id=1267303.1267305.Google Scholar
- Michael McCool, James Reinders, and Arch Robison. 2012. Structured Parallel Programming: Patterns for Efficient Computation (1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA.Google Scholar
- Gleison Mendonça, Breno Guimarães, Péricles Alves, Márcio Pereira, Guido Araújo, and Fernando Magno Quintão Pereira. 2017. DawnCC: Automatic annotation for data parallelism and offloading. ACM Trans. Archit. Code Optim. 14, 2, Article 13 (May 2017), 25 pages. DOI:https://doi.org/10.1145/3084540Google Scholar
Digital Library
- Giovanni De Micheli. 1994. Synthesis and Optimization of Digital Circuits. McGraw–Hill Higher Education.Google Scholar
- Sparsh Mittal. 2017. A survey of techniques for architecting and managing GPU register file. IEEE Trans. Parallel Distrib. Syst. 28, 1 (Jan. 2017), 16--28. DOI:https://doi.org/10.1109/TPDS.2016.2546249Google Scholar
Digital Library
- Steven S. Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA.Google Scholar
Digital Library
- Aaftab Munshi, Benedict Gaster, Timothy G. Mattson, James Fung, and Dan Ginsburg. 2011. OpenCL Programming Guide (1st ed.). Addison-Wesley Professional.Google Scholar
- Dorit Naishlos. 2004. Autovectorization in GCC. In Proceedings of the 2004 GCC Developers Summit.Google Scholar
- V. Krishna Nandivada and Rajkishore Barik. 2013. Improved bitwidth-aware variable packing. ACM Trans. Archit. Code Optim. 10, 3, Article 16 (Sept. 2013), 22 pages. DOI:https://doi.org/10.1145/2509420.2509427Google Scholar
- Henrique Nazaré, Izabela Maffra, Willer Santos, Leonardo Barbosa, Laure Gonnord, and Fernando Magno Quintão Pereira. 2014. Validation of memory accesses through symbolic analyses. In Proceedings of the ACM International Conference on Object-Oriented Programming, Systems, Languages 8 Applications (OOPSLA’14). ACM, New York, 791--809. DOI:https://doi.org/10.1145/2660193.2660205Google Scholar
Digital Library
- Dorit Nuzman and Richard Henderson. 2006. Multi-platform auto-vectorization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE Computer Society, Washington, DC, 281.294.Google Scholar
Digital Library
- Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-vectorize once, run everywhere. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE Computer Society, Washington, DC, 151--160. http://dl.acm.org/citation.cfm?id=2190025.2190062.Google Scholar
Digital Library
- NVIDIA. 2018. NVIDIA Turing GPU Architecture, Graphics Reinvented. White paper WP-09183-001_v01. Retrieved from http://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turingarchitecture/NVIDIA-Turing-Architecture-Whitepaper.pdf.Google Scholar
- Mayur Pandey and Suyog Sarda. 2015. LLVM Cookbook. Packt.Google Scholar
- David A. Patterson and John L. Hennessy. 2013. Computer Organization and Design: The Hardware/Software Interface (5th ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA.Google Scholar
Digital Library
- David A. Patterson and John L. Hennessy. 2017. Computer Organization and Design RISC-V Edition: The Hardware Software Interface (1st ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA.Google Scholar
Digital Library
- L. Petrică, V. Codreanu, S. Coţofană. 2013. VASILE: A reconfigurable vector architecture for instruction level frequency scaling. In Faible Tension Faible Consommation (FTFC'13). IEEE, 1--4. DOI:https://doi.org/10.1109/FTFC.2013.6577772Google Scholar
- Louis-Noël Pouchet. 2014. PolyBench: The Polyhedral Benchmark Suite. Retrieved from https://web.cse.ohiostate.edu/~pouchet.2/software/polybench/.Google Scholar
- Randolf G. Scarborough and Harwood G. Kolsky. 1986. A vectorizing fortran compiler. IBM J. Res. Dev. 30, 2 (March 1986), 163--171. DOI:https://doi.org/10.1147/rd.302.0163Google Scholar
Digital Library
- Selim G. Akl. 1989. The Design and Analysis of Parallel Algorithms. Prentice-Hall, Upper Saddle River, NJ.Google Scholar
- J. P. Shen and M. H. Lipasti. 2005. Modern Processor Design: Fundamentals of Superscalar Processors. Waveland Press. Retrieved from https://books.google.ro/books?id=ffQqAAAAQBAJ.Google Scholar
- Jun Shirako, Kamal Sharma, Naznin Fauzia, Louis-Noël Pouchet, J. Ramanujam, P. Sadayappan, and Vivek Sarkar. 2012. Analytical bounds for optimal tile size selection. In Proceedings of the 21st International Conference on Compiler Construction (CC’12). Springer-Verlag, Berlin, Heidelberg, 101--121. DOI:https://doi.org/10.1007/978-3-642-28652-0_6Google Scholar
Digital Library
- Moritz Sinn, Florian Zuleger, and Helmut Veith. 2017. Complexity and resource bound analysis of imperative programs using difference constraints. Journal of Automated Reasoning 59, 1 (June 2017), 3--45. DOI:https://doi.org/10.1007/s10817-016-9402-4Google Scholar
Digital Library
- David B. Skillicorn and Domenico Talia. 1998. Models and languages for parallel computation. ACM Comput. Surv. 30, 2 (June 1998), 123--169. DOI:https://doi.org/10.1145/280277.280278Google Scholar
Digital Library
- Gheorghe M. Ştefan and Mihaela Maliţa. 2014. Can one-chip parallel computing be liberated from ad hoc solutions? A computation model based approach and its implementation. In Proceedings of the 18th International Conference on Circuits, Systems, Communications and Computers (CSCC’14). 582--597.Google Scholar
- Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, Nathanael Premillieu, Alastair Reid, Alejandro Rico, and Paul Walker. 2017. The ARM scalable vector extension. IEEE Micro 37, 2 (March 2017), 26--39. DOI:https://doi.org/10.1109/MM.2017.35Google Scholar
Digital Library
- J. Teubner, R. Mueller, and G. Alonso. 2010. FPGA acceleration for the frequent item problem. In Proceedings of the IEEE 26th International Conference on Data Engineering (ICDE’10). 669--680. DOI:https://doi.org/10.1109/ICDE.2010.5447856Google Scholar
- Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4, Article 54 (Jan. 2013), 23 pages. DOI:https://doi.org/10.1145/2400682.2400713Google Scholar
Digital Library
- Luc Waeijen, Dongrui She, Henk Corporaal, and Yifan He. 2015. A low-energy wide SIMD architecture with explicit datapath. J. Sign. Process. Syst. 80, 1 (July 2015), 65--86. DOI:https://doi.org/10.1007/s11265-014-0950-8Google Scholar
Digital Library
- Andrew Waterman. 2016. Design of the RISC-V Instruction Set Architecture. Ph.D. Dissertation. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-1.html.Google Scholar
- Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanović. 2014. The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.0. Technical Report UCB/EECS-2014-54. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-54.html.Google Scholar
- Jingling Xue. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers, Norwell, MA.Google Scholar
Digital Library
Index Terms
A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad Memory
Recommendations
LLVM Compiler Implementation for Explicit Parallelization and SIMD Vectorization
LLVM-HPC'17: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPCWith advances of modern multi-core processors and accelerators, many modern applications are increasingly turning to compiler-assisted parallel and vector programming models such as OpenMP, OpenCL, Halide, Python and TensorFlow. It is crucial to ensure ...
Performance Evaluation and Improvements of the PoCL Open-Source OpenCL Implementation on Intel CPUs
IWOCL'21: International Workshop on OpenCLThe Portable Computing Language (PoCL) is a vendor independent open-source OpenCL implementation that aims to support a variety of compute devices in a single platform. Evaluating PoCL versus the Intel OpenCL implementation reveals significant ...
LLVM framework and IR extensions for parallelization, SIMD vectorization and offloading
LLVM-HPC '16: Proceedings of the Third Workshop on LLVM Compiler Infrastructure in HPCLLVM has become an integral part of the software-development ecosystem for developing advanced compilers, high-performance computing software and tools. This paper presents a small set of LLVM IR extensions for explicitly parallel vector, and offloading ...






Comments