Abstract
We describe a SIMD technique for drawing values from multiple discrete distributions, such as sampling from the random variables of a mixture model, that avoids computing a complete table of partial sums of the relative probabilities. A table of alternate ("butterfly-patterned") form is faster to compute, making better use of coalesced memory accesses; from this table, complete partial sums are computed on the fly during a binary search. Measurements using CUDA 7.5 on an NVIDIA Titan Black GPU show that this technique makes an entire machine-learning application that uses a Latent Dirichlet Allocation topic model with 1024 topics about about 13% faster (when using single-precision floating-point data) or about 35% faster (when using double-precision floating-point data) than doing a straightforward matrix transposition after using coalesced accesses.
- Amr Ahmed, Linagjie Hong, and Alexander J. Smola. Nested Chinese Restaurant Franchise Processes: Applications to user tracking and document modeling. In Proc. ICML 2013: 30th International Conference on Machine Learning, pages 1426--1434, Brookline, MA, June 2015. Microtome Publishing. URL: http://www.jmlr.org/proceedings/papers/v28/ahmed13.pdf.Google Scholar
- Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag New York, Inc., 2006.Google Scholar
Digital Library
- David M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77--84, April 2012. doi:10.1145/2133806.2133826. Google Scholar
Digital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. J. Machine Learning Research, 3:993-- 1022, March 2003. URL: http://dl.acm.org/citation.cfm?id=944919.944937.Google Scholar
Digital Library
- Yuri Dotsenko, Naga K. Govindaraju, Peter-Pike Sloan, Charles Boyd, and John Manferdelli. Fast scan algorithms on graphics processors. In Proc. 22nd Annual International Conference on Supercomputing, ICS '08, pages 205--213, New York, 2008. ACM. doi:10.1145/1375527.1375559. Google Scholar
Digital Library
- Peter M. Fenwick. A new data structure for cumulative frequency tables. Software: Practice and Experience, 24(3):327--336, 1994. doi:10.1002/spe.4380240306. Google Scholar
Digital Library
- Thomas L. Griffiths and Mark Steyvers. Finding scientific topics. Proc. National Academy of Sciences of the United States of America, 101(suppl 1):5228--5235, 2004. doi:10.1073/pnas.0307752101. Google Scholar
Cross Ref
- Diane Hu, Rob Hall, and Josh Attenberg. Style in the long tail: Discovering unique interests with latent variable models in large scale social E-commerce. In Proc. 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 1640--1649, New York, August 2014. ACM. URL: http://doi.acm.org/10.1145/2623330.2623338, doi:10.1145/2623330.2623338. Google Scholar
Digital Library
- D. A. Huffman. A method for the construction of minimumredundancy codes. Proc. IRE, 40(9):1098--1101, Sept 1952. doi:10.1109/JRPROC.1952.273898. Google Scholar
Cross Ref
- S. Lennart Johnsson, Tim Harris, and Kapil K. Mathur. Matrix multiplication on the Connection Machine. In Proc. 1989 ACM/IEEE Conference on Supercomputing, pages 326--332, New York, NY, USA, 1989. ACM. URL: http://doi.acm.org/10.1145/76263.76298.Google Scholar
Digital Library
- Joon Hee Kim, Amin Mantrach, Alejandro Jaimes, and Alice Oh. How to compete online for news audience: Modeling words that attract clicks. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, pages 1645--1654, New York, August 2016. ACM. URL: http://doi.acm.org/10.1145/2939672.2939873, doi:10.1145/2939672.2939873. Google Scholar
Digital Library
- Donald E. Knuth. Seminumerical Algorithms (third edition), volume 2 of The Art of Computer Programming. AddisonWesley, Reading, Massachusetts, 1998.Google Scholar
- Donald E. Knuth. Sorting and Searching (second edition), volume 3 of The Art of Computer Programming. AddisonWesley, Reading, Massachusetts, 1998.Google Scholar
- Anthony Lee, Christopher Yau, Michael B. Giles, Arnaud Doucet, and Christopher C. Holmes. On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. J. Computational and Graphical Statistics, 19(4):769--789, 2010. URL: http://arxiv.org/pdf/0905.2441.pdf. Google Scholar
Cross Ref
- Aaron Q. Li, Amr Ahmed, Sujith Ravi, and Alexander J. Smola. Reducing the sampling complexity of topic models. In Proc. 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 891--900, New York, 2014. ACM. doi:10.1145/2623330.2623756. Google Scholar
Digital Library
- Mian Lu, Ge Bai, Qiong Luo, Jie Tang, and Jiuxin Zhao. Accelerating topic model training on a single machine. In Yoshiharu Ishikawa, Jianzhong Li, Wei Wang, Rui Zhang, and Wenjie Zhang, editors, Web Technologies and Applications (APWeb 2013), volume 7808 of Lecture Notes in Computer Science, pages 184--195. Springer Berlin Heidelberg, 2013. doi:10.1007/978--3--642--37401--2_20.Google Scholar
Cross Ref
- Sepideh Maleki, Annie Yang, and Martin Burtscher. Higherorder and tuple-based massively-parallel prefix sums. In Proc. 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '16, pages 539--552, New York, 2016. ACM. URL: http://doi.acm.org/10.1145/ 2908080.2908089, doi:10.1145/2908080.2908089. Google Scholar
Digital Library
- G. Marsaglia. Generating discrete random variables in a computer. Commun. ACM, 6(1):37--38, January 1963. doi: 10.1145/366193.366228. Google Scholar
Digital Library
- Yossi Matias, Jeffrey Scott Vitter, and Wen-Chun Ni. Dynamic generation of discrete random variates. In Proc. Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '93, pages 361--370, Philadelphia, PA, USA, 1993. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=313559.313807.Google Scholar
Digital Library
- NVIDIA. Developer zone website: Cuda toolkit documentation: Cuda toolkit v6.5 programming guide, section B.14. warp shuffle functions, 2015. Online documentation. Accessed February 6, 2015. URL: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions.Google Scholar
- Daniel Ramage, Susan Dumais, and Dan Liebling. Characterizing microblogs with topic models. In Proc. 4th International AAAI Conference on Weblogs and Social Media, pages 130--137, Palo Alto, CA, July 2010. Association for the Advancement of Artificial Intelligence.Google Scholar
- Guy L. Steele Jr. and Jean-Baptiste Tristan. Using butterflypatterned partial sums to optimize GPU memory accesses for drawing from discrete distributions. CoRR (Computing Research Repository at arXiv.org), May 2015. URL: http://arxiv.org/abs/1505.03851.Google Scholar
- Marc A. Suchard, Quanli Wang, Cliburn Chan, Jacob Frelinger, Andrew Cron, and Mike West. Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures. J. Computational and Graphical Statistics, 19(2):419--438, 2010. Google Scholar
Cross Ref
- Jean-Baptiste Tristan, Daniel Huang, Joseph Tassarotti, Adam C. Pocock, Stephen Green, and Guy L. Steele Jr. Augur: Data-parallel probabilistic modeling. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2600--2608. Curran Associates, Inc., 2014. URL: http://papers.nips.cc/book/year-2014.Google Scholar
- Jean-Baptiste Tristan, Joseph Tassarotti, and Guy L. Steele Jr. Efficient training of LDA on a GPU by mean-for-mode estimation. In Proc. ICML 2015: 32nd International Conference on Machine Learning, pages 59--68, Brookline, MA, July 2015. Microtome Publishing. URL: http://jmlr.org/proceedings/papers/v37/tristan15.pdf.Google Scholar
- M. D. Vose. A linear algorithm for generating random numbers with a given distribution. IEEE Trans. Software Engineering, 17(9):972--975, Sept 1991. doi:10.1109/32.92917. Google Scholar
Digital Library
- A. J. Walker. New fast method for generating discrete random numbers with arbitrary frequency distributions. Electronics Letters, 10(8):127--128, April 1974. doi:10.1049/el:19740097. Google Scholar
Cross Ref
- Alastair J. Walker. An efficient method for generating discrete random variables with general distributions. ACM Trans. Math. Software, 3(3):253--256, September 1977. doi:10.1145/355744.355749. Google Scholar
Digital Library
- Nicholas Wilt. The CUDA Handbook: A Comprehensive Guide to GPU Programming. Addison-Wesley, Upper Saddle River, New Jersey, 2013.Google Scholar
- Feng Yan, Ningyi Xu, and Yuan Qi. Parallel inference for latent Dirichlet allocation on graphics processing units. In Advances in Neural Information Processing Systems 22, pages 2134--2142. Curran Associates, Inc., 2009. URL: http://papers.nips.cc/book/year-2009.Google Scholar
Digital Library
- Shengen Yan, Guoping Long, and Yunquan Zhang. StreamScan: Fast scan algorithms for GPUs without global barrier synchronization. In Proc. 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 229--238, New York, 2013. ACM. URL: http://doi.acm.org/10.1145/2442516.2442539, doi: 10.1145/2442516.2442539. Google Scholar
Digital Library
- Huasha Zhao, Biye Jiang, and John Canny. SAME but different: Fast and high-quality Gibbs parameter estimation. CoRR (Computing Research Repository at arXiv.org), September 2014. URL: http://arxiv.org/abs/1409.5402.Google Scholar
- Seth Zimmerman. An optimal search procedure. American Mathematical Monthly, 66(8):690--693, Oct 1959. Google Scholar
Cross Ref
Index Terms
Using Butterfly-Patterned Partial Sums to Draw from Discrete Distributions
Recommendations
Using Butterfly-Patterned Partial Sums to Draw from Discrete Distributions
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingWe describe a SIMD technique for drawing values from multiple discrete distributions, such as sampling from the random variables of a mixture model, that avoids computing a complete table of partial sums of the relative probabilities. A table of ...
Using Butterfly-patterned Partial Sums to Draw from Discrete Distributions
We describe a simd technique for drawing values from multiple discrete distributions, such as sampling from the random variables of a mixture model, that avoids computing a complete table of partial sums of the relative probabilities. A table of ...
Parallel SIMD CPU and GPU Implementations of Berlekamp---Massey Algorithm and Its Error Correction Application
The Berlekamp---Massey algorithm finds the shortest linear feedback shift register for a binary input sequence. A wide range of applications like cryptography and digital signal processing use this algorithm. This research proposes novel parallel ...







Comments