skip to main content
research-article

PANDORA: An Architecture-Independent Parallelizing Approximation-Discovery Framework

Published:11 November 2020Publication History
Skip Abstract Section

Abstract

In this article, we introduce a parallelizing approximation-discovery framework, PANDORA, for automatically discovering application- and architecture-specialized approximations of provided code. PANDORA complements existing compilers and runtime optimizers by generating approximations with a range of Pareto-optimal tradeoffs between performance and error, which enables adaptation to different inputs, different user preferences, and different runtime conditions (e.g., battery life). We demonstrate that PANDORA can create parallel approximations of inherently sequential code by discovering alternative implementations that eliminate loop-carried dependencies. For a variety of functions with loop-carried dependencies, PANDORA generates approximations that achieve speedups ranging from 2.3x to 81x, with acceptable error for many usage scenarios. We also demonstrate PANDORA’s architecture-specialized approximations via FPGA experiments, and highlight PANDORA’s discovery capabilities by removing loop-carried dependencies from a recurrence relation with no known closed-form solution.

References

  1. C. Alvarez, J. Corbal, and M. Valero. 2005. Fuzzy memoization for floating-point multimedia applications. IEEE Trans. Comput. 54, 7 (July 2005), 922--927. DOI:https://doi.org/10.1109/TC.2005.119Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jason Ansel, Yee Lok Wong, Cy Chan, Marek Olszewski, Alan Edelman, and Saman Amarasinghe. 2011. Language and compiler support for auto-tuning variable-accuracy algorithms. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE Computer Society, Washington, DC, 85--96. http://dl.acm.org/citation.cfm?id=2190025.2190056.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Woongki Baek and Trishul M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’10). ACM, New York, NY, 198--209. DOI:https://doi.org/10.1145/1806596.1806620Google ScholarGoogle Scholar
  4. M. Bohr. 2007. A 30 year retrospective on Dennard’s MOSFET scaling paper. Solid-State Circuits Soc. Newsl., IEEE 12, 1 (Winter 2007), 11--13. DOI:https://doi.org/10.1109/N-SSC.2007.4785534Google ScholarGoogle Scholar
  5. Pierre Boulet, Alain Darte, Georges-André Silber, and Frédéric Vivien. 1998. Loop parallelization algorithms: From parallelism extraction to code generation. Parallel Comput. 24, 3–4 (1998), 421--444. DOI:https://doi.org/10.1016/S0167-8191(98)00020-9Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Bugnion, Shih-Wei Liao, B. R. Murphy, S. P. Amarasinghe, J. M. Anderson, M. W. Hall, and M. S Lam. 1996. Maximizing multiprocessor performance with the SUIF compiler. Comput. 29, 12 (1996), 84,85,86,87,88,89. DOI:https://doi.org/10.1109/2.546613Google ScholarGoogle Scholar
  7. Simone Campanoni, Kevin Brownell, Svilen Kanev, Timothy M. Jones, Gu-Yeon Wei, and David Brooks. 2014. HELIX-RC: An architecture-compiler co-design for automatic parallelization of irregular programs. In Proceeding of the 41st Annual International Symposium on Computer Architecuture (ISCA’14). IEEE Press, Piscataway, NJ, 217--228. http://dl.acm.org/citation.cfm?id=2665671.2665705.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Michael Carbin, Deokhwan Kim, Sasa Misailovic, and Martin C. Rinard. 2013. Verified integrity properties for safe approximate program transformations. In Proceedings of the ACM SIGPLAN 2013 Workshop on Partial Evaluation and Program Manipulation (PEPM’13). ACM, New York, NY, 63--66. DOI:https://doi.org/10.1145/2426890.2426901Google ScholarGoogle Scholar
  9. Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages 8 Applications (OOPSLA’13). ACM, New York, NY, 33--52. DOI:https://doi.org/10.1145/2509136.2509546Google ScholarGoogle Scholar
  10. Lakshmi N. Chakrapani, Bilge E. S. Akgul, Suresh Cheemalavagu, Pinar Korkmaz, Krishna V. Palem, and Balasubramanian Seshasayee. 2006. Ultra-efficient (embedded) SOC architectures based on probabilistic CMOS (PCMOS) technology. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’06). Belgium, 1110--1115. http://dl.acm.org/citation.cfm?id=1131481.1131790.Google ScholarGoogle ScholarCross RefCross Ref
  11. Lakshmi N. B. Chakrapani, Kirthi Krishna Muntimadugu, Avinash Lingamneni, Jason George, and Krishna V. Palem. 2008. Highly energy and performance efficient embedded computing through approximately correct arithmetic: A mathematical foundation and preliminary experimental validation. In Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’08). ACM, New York, NY, 187--196. DOI:https://doi.org/10.1145/1450095.1450124Google ScholarGoogle Scholar
  12. Vinay K. Chippa, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. 2013. Analysis and characterization of inherent application resilience for approximate computing. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY, Article 113, 9 pages. DOI:https://doi.org/10.1145/2463209.2488873Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Noel Cressie. 1990. The origins of kriging. Math. Geol. 22, 3 (1990), 239--252. DOI:https://doi.org/10.1007/BF00889887Google ScholarGoogle ScholarCross RefCross Ref
  14. M. de la Guia Solaz and Richard Conway. 2010. Comparative study on wordlength reduction and truncation for low power multipliers. In MIPRO, 2010 Proceedings of the 33rd International Convention. 84--88.Google ScholarGoogle Scholar
  15. A. E. Eichenberger, K. O’Brien, Peng Wu, Tong Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, Byoungro So, Z. Sura, A. Wang, Tao Zhang, Peng Zhao, and M. Gschwind. 2005. Optimizing compiler for the CELL processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05). 161--172. DOI:https://doi.org/10.1109/PACT.2005.33Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Architecture support for disciplined approximate programming. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). ACM, New York, NY, 301--312. DOI:https://doi.org/10.1145/2150976.2151008Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural acceleration for general-purpose approximate programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, 449--460. DOI:https://doi.org/10.1109/MICRO.2012.48Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Félix-Antoine Fortin, François-Michel De Rainville, Marc-André Gardner, Marc Parizeau, and Christian Gagné. 2012. DEAP: Evolutionary algorithms made easy. J. Mach. Learn. Res. 13 (July 2012), 2171--2175.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In FPGA’12: Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’12). ACM, New York, NY, 47--56. DOI:https://doi.org/10.1145/2145694.2145704Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Milind Girkar and Constantine D. Polychronopoulos. 1995. Extracting task-level parallelism. ACM Trans. Program. Lang. Syst. 17, 4 (July 1995), 600--634. DOI:https://doi.org/10.1145/210184.210189Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. V. Gupta, D. Mohapatra, Sang Phill Park, A. Raghunathan, and K. Roy. 2011. IMPACT: IMPrecise adders for low-power approximate computing. In International Symposium on Low Power Electronics and Design (ISLPED’11). 409--414. DOI:https://doi.org/10.1109/ISLPED.2011.5993675Google ScholarGoogle ScholarCross RefCross Ref
  22. J. Han and M. Orshansky. 2013. Approximate computing: An emerging paradigm for energy-efficient design. In 2013 18th IEEE European Test Symposium (ETS). 1--6. DOI:https://doi.org/10.1109/ETS.2013.6569370Google ScholarGoogle ScholarCross RefCross Ref
  23. Gregory S. Hornby. 2006. ALPS: The age-layered population structure for reducing the problem of premature convergence. In Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation (GECCO’06). ACM, New York, NY, 815--822. DOI:https://doi.org/10.1145/1143997.1144142Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Hornik, M. Stinchcombe, and H. White. 1989. Multilayer feedforward networks are universal approximators. Neural Netw. 2, 5 (July 1989), 359--366. DOI:https://doi.org/10.1016/0893-6080(89)90020-8Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Huang. 2015. Moore’s law is dying (and that could be good). Spectrum, IEEE 52, 4 (April 2015), 43--47. DOI:https://doi.org/10.1109/MSPEC.2015.7065418Google ScholarGoogle Scholar
  26. P. M. W. Knijnenburg, T. Kisuki, and M. F. P. O’Boyle. 2002. Iterative Compilation. Springer-Verlag New York, Inc., New York, NY, 171--187.Google ScholarGoogle Scholar
  27. John R. Koza. 1994. Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press, Cambridge, MA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Logan Kugler. 2015. Is “good enough” computing good enough?Commun. ACM 58, 5 (April 2015), 12--14. DOI:https://doi.org/10.1145/2742482Google ScholarGoogle Scholar
  29. Sameer Kulkarni and John Cavazos. 2012. Mitigating the compiler optimization phase-ordering problem using machine learning. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’12). ACM, New York, NY, 147--162. DOI:https://doi.org/10.1145/2384616.2384628Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. M. Pierre Langlois and Dhamin Al-Khalili. 2006. Carry-free approximate squaring functions with O(n) complexity and O(1) delay. IEEE Trans. Circuits Syst. 53-II, 5 (2006), 374--378. http://dblp.uni-trier.de/db/journals/tcas/tcasII53.html#LangloisA06.Google ScholarGoogle ScholarCross RefCross Ref
  31. Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving DRAM refresh-power through critical data partitioning. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). ACM, New York, NY, 213--224. DOI:https://doi.org/10.1145/1950365.1950391Google ScholarGoogle Scholar
  32. James McDermott, David R. White, Sean Luke, Luca Manzoni, Mauro Castelli, Leonardo Vanneschi, Wojciech Jaskowski, Krzysztof Krawiec, Robin Harper, Kenneth De Jong, and Una-May O’Reilly. 2012. Genetic programming needs better benchmarks. In Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation (GECCO’12). ACM, New York, NY, 791--798. DOI:https://doi.org/10.1145/2330163.2330273Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Sasa Misailovic, Michael Carbin, Sara Achour, Zichao Qi, and Martin C. Rinard. 2014. Chisel: Reliability- and accuracy-aware optimization of approximate computational kernels. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages 8 Applications (OOPSLA’14). ACM, New York, NY, 309--328. DOI:https://doi.org/10.1145/2660193.2660231Google ScholarGoogle Scholar
  34. Sasa Misailovic, Michael Carbin, Sara Achour, Zichao Qi, and Martin C. Rinard. 2014. Chisel: Reliability- and accuracy-aware optimization of approximate computational kernels. SIGPLAN Not. 49, 10 (Oct. 2014), 309--328. DOI:https://doi.org/10.1145/2714064.2660231Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sasa Misailovic, Deokhwan Kim, and Martin Rinard. 2013. Parallelizing sequential programs with statistical accuracy tests. ACM Trans. Embed. Comput. Syst. 12, 2s, Article 88 (May 2013), 26 pages. DOI:https://doi.org/10.1145/2465787.2465790Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sasa Misailovic, Stelios Sidiroglou, and Martin C. Rinard. 2012. Dancing with uncertainty. In Proceedings of the 2012 ACM Workshop on Relaxing Synchronization for Multicore and Manycore Scalability (RACES’12). ACM, New York, NY, 51--60. DOI:https://doi.org/10.1145/2414729.2414738Google ScholarGoogle Scholar
  37. D. Mohapatra, V. K. Chippa, A. Raghunathan, and K. Roy. 2011. Design of voltage-scalable meta-functions for approximate computing. In Proceedings of the Design, Automation and Test in Europe Conference Exhibition (DATE), 2011. 1--6. DOI:https://doi.org/10.1109/DATE.2011.5763154Google ScholarGoogle ScholarCross RefCross Ref
  38. T. Moreau, M. Wyse, J. Nelson, A. Sampson, H. Esmaeilzadeh, L. Ceze, and M. Oskin. 2015. SNNAP: Approximate computing on programmable SoCs via neural acceleration. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 603--614. DOI:https://doi.org/10.1109/HPCA.2015.7056066Google ScholarGoogle ScholarCross RefCross Ref
  39. Sriram Narayanan, John Sartori, Rakesh Kumar, and Douglas L. Jones. 2010. Scalable stochastic processors. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’10). European Design and Automation Association, 3001 Leuven, Belgium, Belgium, 335--338. http://dl.acm.org/citation.cfm?id=1870926.1871008.Google ScholarGoogle Scholar
  40. Preeti Ranjan Panda, Nikil D. Dutt, Alexandru Nicolau, Francky Catthoor, Arnout Vandecappelle, Erik Brockmeyer, Chidamber Kulkarni, and Eddy De Greef. 2001. Data memory organization and optimizations in application-specific systems. IEEE Des. Test 18, 3 (May 2001), 56--68. DOI:https://doi.org/10.1109/54.922803Google ScholarGoogle Scholar
  41. J. Park and I. W. Sandberg. 1991. Universal approximation using radial-basis-function networks. Neural Comput. 3, 2 (June 1991), 246--257. DOI:https://doi.org/10.1162/neco.1991.3.2.246Google ScholarGoogle ScholarCross RefCross Ref
  42. Suganth Paul, Nikhil Jayakumar, and Sunil P. Khatri. 2009. A fast hardware approach for approximate, efficient logarithm and antilogarithm computations. IEEE Trans. Very Large Scale Integr. VLSI Syst. 17, 2 (2009), 269--277. DOI:https://doi.org/10.1109/TVLSI.2008.2003481Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Efecan Poyraz, Heming Xu, and Yifeng Cui. 2014. Application-specific I/O optimizations on petascale supercomputers. Procedia Comput. Sci. 29 (2014), 910--923. DOI:https://doi.org/10.1016/j.procs.2014.05.082Google ScholarGoogle ScholarCross RefCross Ref
  44. Lakshminarayanan Renganarayana, Vijayalakshmi Srinivasan, Ravi Nair, and Daniel Prener. 2012. Programming with relaxed synchronization. In Proceedings of the 2012 ACM Workshop on Relaxing Synchronization for Multicore and Manycore Scalability (RACES’12). ACM, New York, NY, 41--50. DOI:https://doi.org/10.1145/2414729.2414737Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. SIGPLAN Not. 46, 6 (June 2011), 164--174. DOI:https://doi.org/10.1145/1993316.1993518Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE’11). ACM, New York, NY, 124--134. DOI:https://doi.org/10.1145/2025113.2025133Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Huaiming Song, Yanlong Yin, Yong Chen, and Xian-He Sun. 2013. Cost-intelligent application-specific data layout optimization for parallel file systems. Cluster Comput. 16, 2 (June 2013), 285--298. DOI:https://doi.org/10.1007/s10586-012-0200-4Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Greg Stitt and David Campbell. 2019. PANDORA: A parallelizing approximation-discovery framework (WIP paper). In Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2019). ACM, New York, NY, 198--202. DOI:https://doi.org/10.1145/3316482.3326345Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Vladimir Vapnik, Steven E. Golowich, and Alex Smola. 1996. Support vector method for function approximation, regression estimation, and signal processing. In Advances in Neural Information Processing Systems 9. MIT Press, 281--287.Google ScholarGoogle Scholar
  50. Swagath Venkataramani, Vinay K. Chippa, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. 2013. Quality programmable vector processors for approximate computing. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, 1--12. DOI:https://doi.org/10.1145/2540708.2540710Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. S. Venkataramani, A. Sabne, V. Kozhikkottu, K. Roy, and A. Raghunathan. 2012. SALSA: Systematic logic synthesis of approximate circuits. In Proceedings of the 2012 49th ACM/EDAC/IEEE Design Automation Conference (DAC). 796--801.Google ScholarGoogle Scholar
  52. John Wernsing, Jeremy Fowers, and Greg Stitt. 2012. RACECAR: A heuristic for automatic function specialization on multi-core heterogeneous systems. In Proceedings of the CASES’12: IEEE/ACM International Conference on Compilers, Architecture, and Synthesis for Embedded Systems.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. K. E. Wires, M. J. Schulte, and J. E. Stine. 2000. Variable-correction truncated floating point multipliers. In Proceedings of the Conference Record of the 34th Asilomar Conference on Signals, Systems and Computers, 2000, Vol. 2. 1344--1348. DOI:https://doi.org/10.1109/ACSSC.2000.911211Google ScholarGoogle Scholar
  54. Ning Zhu, Wang Ling Goh, Weija Zhang, Kiat Seng Yeo, and Zhi Hui Kong. 2010. Design of low-power high-speed truncation-error-tolerant adder and its application in digital signal processing. IEEE Trans. Very Large Scale Integr. VLSI Syst. 18, 8 (Aug. 2010), 1225--1229. DOI:https://doi.org/10.1109/TVLSI.2009.2020591Google ScholarGoogle Scholar
  55. Zeyuan Allen Zhu, Sasa Misailovic, Jonathan A. Kelner, and Martin Rinard. 2012. Randomized accuracy-aware program transformations for efficient approximate computations. SIGPLAN Not. 47, 1 (Jan. 2012), 441--454. DOI:https://doi.org/10.1145/2103621.2103710Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. PANDORA: An Architecture-Independent Parallelizing Approximation-Discovery Framework

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Article Metrics

          • Downloads (Last 12 months)13
          • Downloads (Last 6 weeks)1

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!