Abstract
Due to its flexibility, compute mode is becoming more and more attractive as a way to implement many of the algorithms part of a state-of-the-art rendering pipeline. A key problem commonly encountered in graphics applications is streaming vertex and geometry processing. In a typical triangle mesh, the same vertex is on average referenced six times. To avoid redundant computation during rendering, a post-transform cache is traditionally employed to reuse vertex processing results. However, such a vertex cache can generally not be implemented efficiently in software and does not scale well as parallelism increases. We explore alternative strategies for reusing per-vertex results on-the-fly during massively-parallel software geometry processing. Given an input stream divided into batches, we analyze the effectiveness of sorting, hashing, and intra-thread-group communication for identifying and exploiting local reuse potential. We design and present four vertex reuse strategies tailored to modern GPU architectures. We demonstrate that, in a variety of applications, these strategies not only achieve effective reuse of vertex processing results, but can boost performance by up to 2-3x compared to a naïve approach. Curiously, our experiments also show that our batch-based approaches exhibit behavior similar to the OpenGL implementation on current graphics hardware.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing
- Jatin Chhugani and Subodh Kumar. 2007. Geometry Engine Optimization: Cache Friendly Compressed Representation of Geometry. In Proceedings of the 2007 Symposium on Interactive 3D Graphics and Games (I3D '07). ACM, New York, NY, USA, 9--16. Google Scholar
Digital Library
- Mike M. Chow. 1997. Optimized Geometry Compression for Real-time Rendering. In Proceedings of the 8th Conference on Visualization '97 (VIS '97). IEEE Computer Society Press, Los Alamitos, CA, USA, 347-ff. http://dl.acm.org/citation.cfm?id=266989.267103 Google Scholar
Digital Library
- Jonathan Cohen, Amitabh Varshney, Dinesh Manocha, Greg Turk, Hans Weber, Pankaj Agarwal, Frederick Brooks, and William Wright. 1996. Simplification Envelopes. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '96). ACM, New York, NY, USA, 119--128. Google Scholar
Digital Library
- Michael Deering. 1995. Geometry Compression. In Proceedings of the 22Nd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '95). ACM, New York, NY, USA, 13--20. Google Scholar
Digital Library
- Matthew Eldridge, Homan Igehy, and Pat Hanrahan. 2000. Pomegranate: A Fully Scalable Graphics Architecture. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '00). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 443--454. Google Scholar
Digital Library
- Francine Evans, Steven Skiena, and Amitabh Varshney. 1996. Optimizing Triangle Strips for Fast Rendering. In Proceedings of the 7th Conference on Visualization '96 (VIS '96). IEEE Computer Society Press, Los Alamitos, CA, USA, 319--326. http://dl.acm.org/citation.cfm?id=244979.245626 Google Scholar
Digital Library
- Tom Forsyth. 2006. Linear-speed vertex cache optimisation.Google Scholar
- Ulrich Haar and Sebastian Aaltonen. 2015. GPU-Driven Rendering Pipelines. SIGGRAPH 2015: Advances in Real-time Rendering in Games Talk.Google Scholar
- Hugues Hoppe. 1999. Optimization of Mesh Locality for Transparent Vertex Caching. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '99). ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 269--276. Google Scholar
Digital Library
- Michael Kenzel, Bernhard Kerbl, Dieter Schmalstieg, and Markus Steinberger. 2018. A High-Performance Software Graphics Pipeline Architecture for the GPU. ACM Trans. Graph. 37, 4, Article 140 (Nov. 2018), 15 pages. Google Scholar
Digital Library
- Bernhard Kerbl, Michael Kenzel, Elena Ivanchenko, Dieter Schmalstieg, and Markus Steinberger. 2018. Revisiting The Vertex Cache: Understanding and Optimizing Vertex Processing on the modern GPU. Proc. ACM Comput. Graph. Interact. Tech. 1, 2, Article 29 (Aug. 2018), 16 pages. Google Scholar
Digital Library
- Jon M Kleinberg. 2000. Navigation in a small world. Nature 406, 6798 (2000), 845.Google Scholar
- Christoph Kubisch. 2015. Life of a triangle -- NVIDIA's logical pipeline. Technical Report. NVIDIA Corporation. https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipelineGoogle Scholar
- Christoph Kubisch and Pierre Boudier. 2016. GPU-Driven Rendering. GTC Talk.Google Scholar
- Samuli Laine and Tero Karras. 2011. High-performance Software Rasterization on GPUs. In Proc. High Performance Graphics (HPG '11). 79--88. Google Scholar
Digital Library
- G. Lin and T. P. Y. Yu. 2006. An improved vertex caching scheme for 3D mesh rendering. IEEE Transactions on Visualization and Computer Graphics 12, 4 (July 2006), 640--648. Google Scholar
Digital Library
- Fang Liu, Meng-Cheng Huang, Xue-Hui Liu, and En-Hua Wu. 2010. FreePipe: A Programmable Parallel Rendering Architecture for Efficient Multi-fragment Effects. In Proc. I3D (I3D '10). 75--82. Google Scholar
Digital Library
- Charles Loop. 1987. Smooth Subdivision Surfaces Based on Triangles. Ph.D. Dissertation.Google Scholar
- Steven Molnar, Michael Cox, David Ellsworth, and Henry Fuchs. 1994. A Sorting Classification of Parallel Rendering. IEEE Comput. Graph. Appl. 14, 4 (July 1994), 23--32. Google Scholar
Digital Library
- NVIDIA. 2016. CUDA C Programming Guide. NVIDIA Corporation.Google Scholar
- Anjul Patney, Stanley Tzeng, Kerry A. Seitz, Jr., and John D. Owens. 2015. Piko: A Framework for Authoring Programmable Graphics Pipelines. ACM Trans. Graph. 34, 4, Article 147 (July 2015), 13 pages. Google Scholar
Digital Library
- Karl Pearson. 1905. The problem of the random walk. Nature 72, 1867 (1905), 342.Google Scholar
- Tim Purcell. 2010. Fast Tessellated Rendering on the Fermi GF100. In High Performance Graphics Conf., Hot 3D presentation. Guennadi Riguer. 2006. The Radeon X1000 Series Programming Guide.Google Scholar
- Pedro V. Sander, Diego Nehab, and Joshua Barczak. 2007. Fast Triangle Reordering for Vertex Locality and Reduced Overdraw. ACM Trans. Graph. 26, 3, Article 89 (July 2007). Google Scholar
Digital Library
- Martin Sattlecker and Markus Steinberger. 2015. Reyes Rendering on the GPU. In Proceedings of the 31st Spring Conference on Computer Graphics (SCCG '15). ACM, New York, NY, USA, 31--38. Google Scholar
Digital Library
- Jeremy W. Sheaffer, David Luebke, and Kevin Skadron. 2004. A Flexible Simulation Framework for Graphics Architectures. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (HWWS '04). ACM, New York, NY, USA, 85--94. Google Scholar
Digital Library
- Markus Steinberger, Bernhard Kainz, Bernhard Kerbl, Stefan Hauswiesner, Michael Kenzel, and Dieter Schmalstieg. 2012. Softshell: Dynamic Scheduling on GPUs. ACM Trans. Graph. 31, 6, Article 161 (Nov. 2012), 11 pages. Google Scholar
Digital Library
- Markus Steinberger, Michael Kenzel, Pedro Boechat, Bernhard Kerbl, Mark Dokter, and Dieter Schmalstieg. 2014. Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU. ACM Trans. Graph. 33, 6, Article 228 (Nov. 2014), 11 pages. Google Scholar
Digital Library
- Po-Han Wang, Chia-Lin Yang, Yen-Ming Chen, and Yu-Jung Cheng. 2011. Power Gating Strategies on GPUs. ACM Trans. Archit. Code Optim. 8, 3, Article 13 (Oct. 2011), 25 pages. Google Scholar
Digital Library
- Graham Wihlidal. 2016. Optimizing the Graphics Pipeline with Compute. GDC Talk.Google Scholar
- Kun Zhou, Xin Huang, Weiwei Xu, Baining Guo, and Heung-Yeung Shum. 2007. Direct Manipulation of Subdivision Surfaces on GPUs. ACM Trans. Graph. 26, 3, Article 91 (July 2007). Google Scholar
Digital Library
Index Terms
On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing
Recommendations
Massively parallel differential evolution--pattern search optimization with graphics hardware acceleration: an investigation on bound constrained optimization problems
This paper presents a novel parallel Differential Evolution (DE) algorithm with local search for solving function optimization problems, utilizing graphics hardware acceleration. As a population-based meta-heuristic, DE was originally designed for ...
Massively LDPC Decoding on Multicore Architectures
Unlike usual VLSI approaches necessary for the computation of intensive Low-Density Parity-Check (LDPC) code decoders, this paper presents flexible software-based LDPC decoders. Algorithms and data structures suitable for parallel computing are proposed ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...






Comments