skip to main content
research-article

Sublinear estimation of entropy and information distances

Published:06 November 2009Publication History
Skip Abstract Section

Abstract

In many data mining and machine learning problems, the data items that need to be clustered or classified are not arbitrary points in a high-dimensional space, but are distributions, that is, points on a high-dimensional simplex. For distributions, natural measures are not ℓp distances, but information-theoretic measures such as the Kullback-Leibler and Hellinger divergences. Similarly, quantities such as the entropy of a distribution are more natural than frequency moments. Efficient estimation of these quantities is a key component in algorithms for manipulating distributions. Since the datasets involved are typically massive, these algorithms need to have only sublinear complexity in order to be feasible in practice.

We present a range of sublinear-time algorithms in various oracle models in which the algorithm accesses the data via an oracle that supports various queries. In particular, we answer a question posed by Batu et al. on testing whether two distributions are close in an information-theoretic sense given independent samples. We then present optimal algorithms for estimating various information-divergences and entropy with a more powerful oracle called the combined oracle that was also considered by Batu et al. Finally, we consider sublinear-space algorithms for these quantities in the data-stream model. In the course of doing so, we explore the relationship between the aforementioned oracle models and the data-stream model. This continues work initiated by Feigenbaum et al. An important additional component to the study is considering data streams that are ordered randomly rather than just those which are ordered adversarially.

References

  1. Ali, S. M., and Silvey, S. D. 1966. A general class of coefficients of divergence of one distribution from another. J. Royal Statist. Soc. Series B 28, 131--142.Google ScholarGoogle Scholar
  2. Alon, N., Matias, Y., and Szegedy, M. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1, 137--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Amari, S. 1985. Differential-Geometrical Methods in Statistics. Springer-Verlag, Berlin.Google ScholarGoogle Scholar
  4. Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. 2005. Clustering with bregman divergences. J. Mach. Learn. Res. 6, 1705--1749. Google ScholarGoogle ScholarCross RefCross Ref
  5. Bar-Yossef, Z. 2002. The complexity of massive data set computations. Ph.D. thesis, University of California at Berkeley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bar-Yossef, Z., Kumar, R., and Sivakumar, D. 2001. Sampling algorithms: Lower bounds and applications. In Proceedings of the ACM Symposium on Theory of Computing. 266--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Batu, T., Dasgupta, S., Kumar, R., and Rubinfeld, R. 2005. The complexity of approximating the entropy. SIAM J. Comput. 35, 1, 132--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Batu, T., Fortnow, L., Rubinfeld, R., Smith, W. D., and White, P. 2000. Testing that distributions are close. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 259--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bhuvanagiri, L., and Ganguly, S. 2006. Estimating entropy over data streams. In Proceedings of the Annual European Symposium on Algorithms (ESA). 148--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chakrabarti, A., Cormode, G., and McGregor, A. 2007. A near-optimal algorithm for computing the entropy of a stream. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 328--335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Čencov, N. N. 1981. Statistical decision rules and optimal inference. Transl. Math. Monographs, American Mathematical Society (Providence).Google ScholarGoogle Scholar
  12. Chakrabarti, A., Do Ba, K., and Muthukrishnan, S. 2006. Estimating entropy and entropy norm on data streams. In Proceedings of the Symposium on Theoretical Aspects of Computer Science. 196--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cover, T. M., and Thomas, J. A. 1991. Elements of Information Theory. Wiley Series in Telecommunications. John Wiley&Sons, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Csiszár, I. 1991. Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Ann. Statist., 2032--2056.Google ScholarGoogle Scholar
  15. Demaine, E. D., López-Ortiz, A., and Munro, J. I. 2002. Frequency estimation of internet packet streams with limited space. In Proceedings of the European Symposium on Algorithms. 348--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dhillon, I. S., Mallela, S., and Kumar, R. 2003. A divisive information-theoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 3, 1265--1287. Google ScholarGoogle ScholarCross RefCross Ref
  17. Feigenbaum, J., Kannan, S., Strauss, M., and Viswanathan, M. 2002a. An approximate L1 difference algorithm for massive data streams. SIAM J. Comput. 32, 1, 131--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Feigenbaum, J., Kannan, S., Strauss, M., and Viswanathan, M. 2002b. Testing and spot-checking of data streams. Algorithmica 34, 1, 67--80.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. 2001. Surfing wavelets on streams: One-Pass summaries for approximate aggregate queries. In Proceedings of the International Conference on Very Large Databases. 79--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Gu, Y., McCallum, A., and Towsley, D. 2005. Detecting anomalies in network traffic using maximum entropy estimation. In Proceedings of the Internet Measurement Conference. 345--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Guha, S., and McGregor, A. 2006. Approximate quantiles and the order of the stream. In Proceedings of the ACM Symposium on Principles of Database Systems. 273--279. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Guha, S., and McGregor, A. 2007a. Lower bounds for quantile estimation in random-order and multi-pass streaming. In Proceedings of the International Colloquium on Automata, Languages and Programming. 704--715. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Guha, S., and McGregor, A. 2007b. Space-Efficient sampling. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). 169--176.Google ScholarGoogle Scholar
  24. Guha, S., McGregor, A., and Venkatasubramanian, S. 2006. Streaming and sublinear approximation of entropy and information distances. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 733--742. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Henzinger, M. R., Raghavan, P., and Rajagopalan, S. 1999. Computing on data streams. In External Memory Algorithms, 107--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kearns, M. J., Mansour, Y., Ron, D., Rubinfeld, R., Schapire, R. E., and Sellie, L. 1994. On the learnability of discrete distributions. In Proceedings of the ACM Symposium on Theory of Computing. 273--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Krishnamurthy, B., Venkatasubramanian, S., and Madhyastha, H. V. 2005. On stationarity in internet measurements through an information-theoretic lens. In Proceedings of the ICDE Workshops. 1185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lall, A., Sekar, V., Ogihara, M., Xu, J., and Zhang, H. 2006. Data streaming algorithms for estimating entropy of network traffic. In SIGMETRICS/Performance, R. A. Marie, P. B. Key, and E. Smirni, Eds. ACM, 145--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Liese, F., and Vajda, F. 1987. Convex statistical distances. Teubner-Texte zur Mathematik, Band 95, Leipzig.Google ScholarGoogle Scholar
  30. Munro, J. I., and Paterson, M. 1980. Selection and sorting with limited storage. Theor. Comput. Sci. 12, 315--323.Google ScholarGoogle ScholarCross RefCross Ref
  31. Shannon, C. E. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379--423 and 623--656.Google ScholarGoogle ScholarCross RefCross Ref
  32. Tishby, N., Pereira, F., and Bialek, W. 1999. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing. 368--377.Google ScholarGoogle Scholar
  33. Topsøe, F. 2000. Some inequalities for information divergence and related measures of discrimination. IEEE Trans. Inf. Theory 46, 4, 1602--1609. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Wagner, A., and Plattner, B. 2005. Entropy based worm and anomaly detection in fast IP networks. In Proceedings of the IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises. 172--177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xu, K., Zhang, Z.-L., and Bhattacharyya, S. 2005. Profiling internet backbone traffic: Behavior models and applications. In Proceedings of the ACM/Data Communications Festival (SIGCOMM). 169--180. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Sublinear estimation of entropy and information distances

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Algorithms
        ACM Transactions on Algorithms  Volume 5, Issue 4
        October 2009
        281 pages
        ISSN:1549-6325
        EISSN:1549-6333
        DOI:10.1145/1597036
        Issue’s Table of Contents

        Copyright © 2009 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 November 2009
        • Revised: 1 October 2007
        • Accepted: 1 October 2007
        • Received: 1 December 2006
        Published in talg Volume 5, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader