skip to main content
research-article

Using Butterfly-Patterned Partial Sums to Draw from Discrete Distributions

Published:26 January 2017Publication History
Skip Abstract Section

Abstract

We describe a SIMD technique for drawing values from multiple discrete distributions, such as sampling from the random variables of a mixture model, that avoids computing a complete table of partial sums of the relative probabilities. A table of alternate ("butterfly-patterned") form is faster to compute, making better use of coalesced memory accesses; from this table, complete partial sums are computed on the fly during a binary search. Measurements using CUDA 7.5 on an NVIDIA Titan Black GPU show that this technique makes an entire machine-learning application that uses a Latent Dirichlet Allocation topic model with 1024 topics about about 13% faster (when using single-precision floating-point data) or about 35% faster (when using double-precision floating-point data) than doing a straightforward matrix transposition after using coalesced accesses.

References

  1. Amr Ahmed, Linagjie Hong, and Alexander J. Smola. Nested Chinese Restaurant Franchise Processes: Applications to user tracking and document modeling. In Proc. ICML 2013: 30th International Conference on Machine Learning, pages 1426--1434, Brookline, MA, June 2015. Microtome Publishing. URL: http://www.jmlr.org/proceedings/papers/v28/ahmed13.pdf.Google ScholarGoogle Scholar
  2. Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag New York, Inc., 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. David M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77--84, April 2012. doi:10.1145/2133806.2133826. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. J. Machine Learning Research, 3:993-- 1022, March 2003. URL: http://dl.acm.org/citation.cfm?id=944919.944937.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Yuri Dotsenko, Naga K. Govindaraju, Peter-Pike Sloan, Charles Boyd, and John Manferdelli. Fast scan algorithms on graphics processors. In Proc. 22nd Annual International Conference on Supercomputing, ICS '08, pages 205--213, New York, 2008. ACM. doi:10.1145/1375527.1375559. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Peter M. Fenwick. A new data structure for cumulative frequency tables. Software: Practice and Experience, 24(3):327--336, 1994. doi:10.1002/spe.4380240306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Thomas L. Griffiths and Mark Steyvers. Finding scientific topics. Proc. National Academy of Sciences of the United States of America, 101(suppl 1):5228--5235, 2004. doi:10.1073/pnas.0307752101. Google ScholarGoogle ScholarCross RefCross Ref
  8. Diane Hu, Rob Hall, and Josh Attenberg. Style in the long tail: Discovering unique interests with latent variable models in large scale social E-commerce. In Proc. 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 1640--1649, New York, August 2014. ACM. URL: http://doi.acm.org/10.1145/2623330.2623338, doi:10.1145/2623330.2623338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. A. Huffman. A method for the construction of minimumredundancy codes. Proc. IRE, 40(9):1098--1101, Sept 1952. doi:10.1109/JRPROC.1952.273898. Google ScholarGoogle ScholarCross RefCross Ref
  10. S. Lennart Johnsson, Tim Harris, and Kapil K. Mathur. Matrix multiplication on the Connection Machine. In Proc. 1989 ACM/IEEE Conference on Supercomputing, pages 326--332, New York, NY, USA, 1989. ACM. URL: http://doi.acm.org/10.1145/76263.76298.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Joon Hee Kim, Amin Mantrach, Alejandro Jaimes, and Alice Oh. How to compete online for news audience: Modeling words that attract clicks. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, pages 1645--1654, New York, August 2016. ACM. URL: http://doi.acm.org/10.1145/2939672.2939873, doi:10.1145/2939672.2939873. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Donald E. Knuth. Seminumerical Algorithms (third edition), volume 2 of The Art of Computer Programming. AddisonWesley, Reading, Massachusetts, 1998.Google ScholarGoogle Scholar
  13. Donald E. Knuth. Sorting and Searching (second edition), volume 3 of The Art of Computer Programming. AddisonWesley, Reading, Massachusetts, 1998.Google ScholarGoogle Scholar
  14. Anthony Lee, Christopher Yau, Michael B. Giles, Arnaud Doucet, and Christopher C. Holmes. On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. J. Computational and Graphical Statistics, 19(4):769--789, 2010. URL: http://arxiv.org/pdf/0905.2441.pdf. Google ScholarGoogle ScholarCross RefCross Ref
  15. Aaron Q. Li, Amr Ahmed, Sujith Ravi, and Alexander J. Smola. Reducing the sampling complexity of topic models. In Proc. 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 891--900, New York, 2014. ACM. doi:10.1145/2623330.2623756. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mian Lu, Ge Bai, Qiong Luo, Jie Tang, and Jiuxin Zhao. Accelerating topic model training on a single machine. In Yoshiharu Ishikawa, Jianzhong Li, Wei Wang, Rui Zhang, and Wenjie Zhang, editors, Web Technologies and Applications (APWeb 2013), volume 7808 of Lecture Notes in Computer Science, pages 184--195. Springer Berlin Heidelberg, 2013. doi:10.1007/978--3--642--37401--2_20.Google ScholarGoogle ScholarCross RefCross Ref
  17. Sepideh Maleki, Annie Yang, and Martin Burtscher. Higherorder and tuple-based massively-parallel prefix sums. In Proc. 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '16, pages 539--552, New York, 2016. ACM. URL: http://doi.acm.org/10.1145/ 2908080.2908089, doi:10.1145/2908080.2908089. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Marsaglia. Generating discrete random variables in a computer. Commun. ACM, 6(1):37--38, January 1963. doi: 10.1145/366193.366228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yossi Matias, Jeffrey Scott Vitter, and Wen-Chun Ni. Dynamic generation of discrete random variates. In Proc. Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '93, pages 361--370, Philadelphia, PA, USA, 1993. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=313559.313807.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. NVIDIA. Developer zone website: Cuda toolkit documentation: Cuda toolkit v6.5 programming guide, section B.14. warp shuffle functions, 2015. Online documentation. Accessed February 6, 2015. URL: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-functions.Google ScholarGoogle Scholar
  21. Daniel Ramage, Susan Dumais, and Dan Liebling. Characterizing microblogs with topic models. In Proc. 4th International AAAI Conference on Weblogs and Social Media, pages 130--137, Palo Alto, CA, July 2010. Association for the Advancement of Artificial Intelligence.Google ScholarGoogle Scholar
  22. Guy L. Steele Jr. and Jean-Baptiste Tristan. Using butterflypatterned partial sums to optimize GPU memory accesses for drawing from discrete distributions. CoRR (Computing Research Repository at arXiv.org), May 2015. URL: http://arxiv.org/abs/1505.03851.Google ScholarGoogle Scholar
  23. Marc A. Suchard, Quanli Wang, Cliburn Chan, Jacob Frelinger, Andrew Cron, and Mike West. Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures. J. Computational and Graphical Statistics, 19(2):419--438, 2010. Google ScholarGoogle ScholarCross RefCross Ref
  24. Jean-Baptiste Tristan, Daniel Huang, Joseph Tassarotti, Adam C. Pocock, Stephen Green, and Guy L. Steele Jr. Augur: Data-parallel probabilistic modeling. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2600--2608. Curran Associates, Inc., 2014. URL: http://papers.nips.cc/book/year-2014.Google ScholarGoogle Scholar
  25. Jean-Baptiste Tristan, Joseph Tassarotti, and Guy L. Steele Jr. Efficient training of LDA on a GPU by mean-for-mode estimation. In Proc. ICML 2015: 32nd International Conference on Machine Learning, pages 59--68, Brookline, MA, July 2015. Microtome Publishing. URL: http://jmlr.org/proceedings/papers/v37/tristan15.pdf.Google ScholarGoogle Scholar
  26. M. D. Vose. A linear algorithm for generating random numbers with a given distribution. IEEE Trans. Software Engineering, 17(9):972--975, Sept 1991. doi:10.1109/32.92917. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. J. Walker. New fast method for generating discrete random numbers with arbitrary frequency distributions. Electronics Letters, 10(8):127--128, April 1974. doi:10.1049/el:19740097. Google ScholarGoogle ScholarCross RefCross Ref
  28. Alastair J. Walker. An efficient method for generating discrete random variables with general distributions. ACM Trans. Math. Software, 3(3):253--256, September 1977. doi:10.1145/355744.355749. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Nicholas Wilt. The CUDA Handbook: A Comprehensive Guide to GPU Programming. Addison-Wesley, Upper Saddle River, New Jersey, 2013.Google ScholarGoogle Scholar
  30. Feng Yan, Ningyi Xu, and Yuan Qi. Parallel inference for latent Dirichlet allocation on graphics processing units. In Advances in Neural Information Processing Systems 22, pages 2134--2142. Curran Associates, Inc., 2009. URL: http://papers.nips.cc/book/year-2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Shengen Yan, Guoping Long, and Yunquan Zhang. StreamScan: Fast scan algorithms for GPUs without global barrier synchronization. In Proc. 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 229--238, New York, 2013. ACM. URL: http://doi.acm.org/10.1145/2442516.2442539, doi: 10.1145/2442516.2442539. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Huasha Zhao, Biye Jiang, and John Canny. SAME but different: Fast and high-quality Gibbs parameter estimation. CoRR (Computing Research Repository at arXiv.org), September 2014. URL: http://arxiv.org/abs/1409.5402.Google ScholarGoogle Scholar
  33. Seth Zimmerman. An optimal search procedure. American Mathematical Monthly, 66(8):690--693, Oct 1959. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Using Butterfly-Patterned Partial Sums to Draw from Discrete Distributions

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!