article
Free Access

Faster methods for random sampling

Online:01 July 1984Publication History

Abstract

Several new methods are presented for selecting n records at random without replacement from a file containing N records. Each algorithm selects the records for the sample in a sequential manner—in the same order the records appear in the file. The algorithms are online in that the records for the sample are selected iteratively with no preprocessing. The algorithms require a constant amount of space and are short and easy to implement. The main result of this paper is the design and analysis of Algorithm D, which does the sampling in O(n) time, on the average; roughly n uniform random variates are generated, and approximately n exponentiation operations (of the form ab, for real numbers a and b) are performed during the sampling. This solves an open problem in the literature. CPU timings on a large mainframe computer indicate that Algorithm D is significantly faster than the sampling algorithms in use today.

References

  1. 1 Bentley, J.L. and Saxe, J.B. Generating sorted lists of random numbers. ACM Trans. Math. Softw. 6, 3 (Sept. 1980), 359-364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2 Ernvall, J. and Nevalainen, O. An algorithm for unbiased random sampling. Comput. J. 25, 1 (January 1982), 45-47.Google ScholarGoogle ScholarCross RefCross Ref
  3. 3 Fan, C.T., Muller, M.E., and Rezucha, I. Development of sampling plans by using sequential (item-by-item) selection techniques and digital computers. Am. Stat. Assn. J. 57 (June 1962), 387-402.Google ScholarGoogle ScholarCross RefCross Ref
  4. 4 Jones, T.G. A note on sampling a tape file. Commun. ACM, 5, 6 (June 1962), 343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5 Kawarasaki, J. and Sibuya, M. Random numbers for simple random sampling without replacement. Keio Math. Sem. Rep No. 7 (1982), 1- 9.Google ScholarGoogle Scholar
  6. 6 Knuth, D.E. The Art of Computer Programming, Vol. 2, Seminumerical Algorithms. Addison-Wesley, Reading, MA (second edition, 1981). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7 Lindstrom, E.E. and Vitter, J.S. The design and analysis of BucketSort for bubble memory secondary storage. Tech. Rep. CS-83- 23, Brown University, Providence, RI, (September 1983). See also U.S. Patent Application Provisional Serial No. 500741 (filed June 3, 1983).Google ScholarGoogle Scholar
  8. 8 Sedgewick, R. Algorithms. Addison-Wesley, Reading, MA (1983). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9 Vitter, J.S. Random sampling with a reservoir. Tech. Rep. CS-83-17, Brown University, Providence, RI, (July 1983).Google ScholarGoogle Scholar
  10. 10 Vitter, J.S. Optimum algorithms for two random sampling problems. In Proceedings of the 24th IEEE Symposium on Foundations of Computer Science, Tucson, AZ (November 1983), 65-75.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Faster methods for random sampling

                    Reviews

                    Robert M. Lynch

                    The Vitter paper describes several new sequential algorithms for randomly sampling :In records sequentially from a file containing :IN records. It is presented in nine sections, including the Appendix. The main text of the paper focuses on describing the algorithms and presenting performance details, and the Appendix presents the implementation of the algorithms in an pseudocode resembling PASCAL. Each of the algorithms can be implemented without the requirement of preprocessing records, and each requires a constant amount of space.:P In Section 1, the scope of the paper is presented. In Section 1, Algorithm S is described. Algorithm S was introduced by Fan et al. [1] and Jones [2], and it was again presented by Knuth [3]. Briefly, with Algorithm S, the next record in a file is either selected or skipped based upon the value of the generated uniform random variate. This select:Uskip process continues until the desired number of records :In is selected.:P In Section 4, Algorithm D is discussed at length, and it is this algorithm that is the focus of the paper. A naive implementation is presented in Section 4, but in Section 5, optimizations are given to improve its performance. These improvements reduce by 1/2 the number of variates that must be generated and the number of exponential computations that must be performed.:P In Section 6, algorithm performance is presented and in Section 7, empirical comparisons for a FORTRAN 77 implementation on an IBM 3081 are discussed. Run:Utime comparisons in microseconds are l6:In for Algorithm S and 55:In for Algorithm D. Further, for Algorithm S the expected number of random variates is [:IN±1):In]/(:IN±1) with an generated run:U time of :I0(:IN). Algorithm D has an expected number of generated variates to be :In and an expected average run:Utime of :I0(:In). Both the empirical and expected comparisons make D the algorithm of choice. Section 8 brings it all together and discusses work that is related to algorithms.:P The paper is well written and well organized. Statisticians who are not of a computer science will have little difficulty following the paper. Further leaning, the implementation of the algorithms appears relatively straightfoward. As pointed out by Vitter, and by the paper's title, Algorithm D is a major improvement over sampling algorithms in use today. :6HR EFERENCES [1] F AN, C.T.; M ULLER, M.E.; and R EZUCHA, I. Development of sampling plans by using seguential (item:Uby:Uitem) selection techniques and digital computers, :IJ. Am. Stat. :1A57 (June 1961), 387-402. See :ICR :1A4, 1 (Jan.:UFeb. 1963), Rev. 3,748. [2]J ONES, T.G. A note on sampling a tape file, :ICommun. ACM :1A5 (June 1962), 343. [3] K NUTH, D.E. :IThe art of computer programming, Vol. 2:seminumerical algorithms, Addison:UWesley, Reading, MA, 1981. See :1A22, 9 (Sept. 1981), Rev. 38,390.

                    Access critical reviews of Computing literature here

                    Become a reviewer for Computing Reviews.

                    Comments

                    Login options

                    Check if you have access through your login credentials or your institution to get full access on this article.

                    Sign in

                    Full Access

                    • Published in

                      Communications of the ACM cover image
                      Communications of the ACM  Volume 27, Issue 7
                      July 1984
                      96 pages
                      ISSN:0001-0782
                      EISSN:1557-7317
                      DOI:10.1145/358105
                      Issue’s Table of Contents

                      Copyright © 1984 ACM

                      Publisher

                      Association for Computing Machinery

                      New York, NY, United States

                      Publication History

                      • Online: 1 July 1984

                      Permissions

                      Request permissions about this article.

                      Request Permissions

                      Qualifiers

                      • article

                    PDF Format

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader
                    About Cookies On This Site

                    We use cookies to ensure that we give you the best experience on our website.

                    Learn more

                    Got it!