skip to main content
article
Free access

A linear-time probabilistic counting algorithm for database applications

Published: 01 June 1990 Publication History
  • Get Citation Alerts
  • Abstract

    We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has O(q) time complexity, where q is the number of values including duplicates, and produces an estimation with an arbitrary accuracy prespecified by the user using only a small amount of space. Traditionally, accurate counts of unique values were obtained by sorting, which has O(q log q) time complexity. Our technique, called linear counting, is based on hashing. We present a comprehensive theoretical and experimental analysis of linear counting. The analysis reveals an interesting result: A load factor (number of unique values/hash table size) much larger than 1.0 (e.g., 12) can be used for accurate estimation (e.g., 1% of error). We present this technique with two important applications to database problems: namely, (1) obtaining the column cardinality (the number of unique values in a column of a relation) and (2) obtaining the join selectivity (the number of unique values in the join column resulting from an unconditional join divided by the number of unique join column values in the relation to he joined). These two parameters are important statistics that are used in relational query optimization and physical database design.

    References

    [1]
    ASTRAHAN, M., SCHKOLNICK, M., AND WHANG, K.-Y. Approximating the number of unique values of an attribute without sorting. Inf. Syst. 12, 1 (1987), 11-15.
    [2]
    BITTON, D., AND DEWITT, D.J. Duplicate record elimination in large data files. ACM Trans. Database Syst. 8, 2 (June 1983), 255-265.
    [3]
    CONTE, S. D., AND DE BOOR, C. Elementary Numerical Analysis: An Algorithmic Approach. 3rd ed., McGraw-Hill, New York, 1980.
    [4]
    DEMOLOMBE, R. Estimation of the number of tuples satisfying a query expressed in predicate calculus language. In Proceedings of the 6th International Conference on Very Large Data Bases, 1980, pp. 55-63.
    [5]
    FELLER, W. An Introduction to Probability Theory and Its Applications. Vol. 1, 3rd ed., Wiley, New York, 1968.
    [6]
    GELENBE, R., AND GARDY, D. The size of projections of relations satisfying a functional dependency. In Proceedings of the 8th International Conference on Very Large Data Bases (Mexico City, 1982), pp. 325-333.
    [7]
    JOHNSON, N. L., AND KOTZ, S. Urn Models and Their Application. Wiley, New York, 1977.
    [8]
    MOOD, A. M., GRAYBILL, F. A., AND BOES, D.C. Introduction to the Theory of Statistics. 3rd ed., McGraw-Hill, New York, 1974.
    [9]
    OLKEN, F., AND ROTEM, D. Simple random sampling from relational databases. In Proceedings of the 12th International Conference on Very Large Data Bases (Kyoto, Aug. 1986), pp. 160-169.
    [10]
    ROWE, N.C. Antisampling for estimation: An overview. IEEE Trans. Softw. Eng. SE-11, 10 (Oct. 1985), 1081-1091.
    [11]
    SELINGER, P., ET AL. Access path selection in a relational database management system. In Proceedings of the International Conference on Management of Data, A CM SIGMOD, 1979, pp. 23-34.
    [12]
    SHERMAN, B. A random variable related to the spacing of sample values. Ann. Math. Statistics 21 (1950), 339-361.
    [13]
    VANDER-ZANDEN, B. W., TAYLOR, H. M., AND BITTON, D. Estimating block accesses when attributes are correlated. In Proceedings of the 12th International Conference on Very Large Data Bases (Kyoto, Aug. 1986), pp. 119-127.
    [14]
    WHANG, K., WIEDERHOLD, G., AND SAGALOWICZ, D. Estimating block accesses in database organizations: A closed noniterative formula. Commum. ACM, 26, 11 (Nov. 1983), 940-944.
    [15]
    WHANG, K., WIEDERHOLD, G., AND SAGALOWICZ, D. Separability--An approach to physical database design. IEEE Trans. Comput. C-33, 3 (Mar. 1984), 209-222.
    [16]
    WIEDERHOLD, G., AND EL-MARS1, R. The structural model for database design, in Proceedings of the International Conference on Entity Relationship Approach (Los Angeles, Dec. 1979), pp. 247-267.
    [17]
    W|EDERHOLD, G. Database Design. 2nd ed., McGraw-Hill, New York, 1983.
    [18]
    YOUSSEFI, K., AND WONG, E. Query processing in a relational database management system. In Proceedings of the 5th International Conference on Very Large Data Bases, 1979, pp. 409-417.
    [19]
    Yu, C. T., LAM, K., SIu, M. K., AND OZSOYOGLU, M. Performance analysis of three related assignment problems. In Proceedings of the International Conference on Management of Data, A CM SIGMOD, 1979, pp. 82-92.

    Cited By

    View all
    • (2024)ComPipe: A Novel Flow Placement and Measurement Algorithm for Programmable Composite PipelinesElectronics10.3390/electronics1306102213:6(1022)Online publication date: 8-Mar-2024
    • (2024)An Accurate and Invertible Sketch for Super Spread DetectionElectronics10.3390/electronics1301022213:1(222)Online publication date: 3-Jan-2024
    • (2024)UltraLogLog: A Practical and More Space-Efficient Alternative to HyperLogLog for Approximate Distinct CountingProceedings of the VLDB Endowment10.14778/3654621.365463217:7(1655-1668)Online publication date: 1-Mar-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Database Systems
    ACM Transactions on Database Systems  Volume 15, Issue 2
    June 1990
    183 pages
    ISSN:0362-5915
    EISSN:1557-4644
    DOI:10.1145/78922
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 1990
    Published in TODS Volume 15, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)559
    • Downloads (Last 6 weeks)41

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)ComPipe: A Novel Flow Placement and Measurement Algorithm for Programmable Composite PipelinesElectronics10.3390/electronics1306102213:6(1022)Online publication date: 8-Mar-2024
    • (2024)An Accurate and Invertible Sketch for Super Spread DetectionElectronics10.3390/electronics1301022213:1(222)Online publication date: 3-Jan-2024
    • (2024)UltraLogLog: A Practical and More Space-Efficient Alternative to HyperLogLog for Approximate Distinct CountingProceedings of the VLDB Endowment10.14778/3654621.365463217:7(1655-1668)Online publication date: 1-Mar-2024
    • (2024)An LDP Compatible Sketch for Securely Approximating Set Intersection CardinalitiesProceedings of the ACM on Management of Data10.1145/36392812:1(1-27)Online publication date: 26-Mar-2024
    • (2024)From CountMin to Super kJoin Sketches for Flow Spread EstimationIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.327966511:3(2353-2370)Online publication date: May-2024
    • (2024)Multi-Resolution Odd Sketch for Mining Extended Jaccard Similarity of Dynamic Streaming setsIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.327580911:3(2399-2414)Online publication date: May-2024
    • (2024)Cardinality Estimation Adaptive Cuckoo Filters (CE-ACF): Approximate Membership Check and Distinct Query Count for High-Speed Network MonitoringIEEE/ACM Transactions on Networking10.1109/TNET.2023.330230632:2(959-970)Online publication date: 1-Apr-2024
    • (2024)Half-Xor: A Fully-Dynamic Sketch for Estimating the Number of Distinct Values in Big TablesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335971036:7(3111-3125)Online publication date: Jul-2024
    • (2024)Learning complex predicates for cardinality estimation using recursive neural networksInformation Systems10.1016/j.is.2024.102402124(102402)Online publication date: Sep-2024
    • (2024)SuperGuardian: Superspreader removal for cardinality estimation in data streamingInformation Systems10.1016/j.is.2024.102351122(102351)Online publication date: May-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media