Abstract
In many data mining and machine learning problems, the data items that need to be clustered or classified are not arbitrary points in a high-dimensional space, but are distributions, that is, points on a high-dimensional simplex. For distributions, natural measures are not ℓp distances, but information-theoretic measures such as the Kullback-Leibler and Hellinger divergences. Similarly, quantities such as the entropy of a distribution are more natural than frequency moments. Efficient estimation of these quantities is a key component in algorithms for manipulating distributions. Since the datasets involved are typically massive, these algorithms need to have only sublinear complexity in order to be feasible in practice.
We present a range of sublinear-time algorithms in various oracle models in which the algorithm accesses the data via an oracle that supports various queries. In particular, we answer a question posed by Batu et al. on testing whether two distributions are close in an information-theoretic sense given independent samples. We then present optimal algorithms for estimating various information-divergences and entropy with a more powerful oracle called the combined oracle that was also considered by Batu et al. Finally, we consider sublinear-space algorithms for these quantities in the data-stream model. In the course of doing so, we explore the relationship between the aforementioned oracle models and the data-stream model. This continues work initiated by Feigenbaum et al. An important additional component to the study is considering data streams that are ordered randomly rather than just those which are ordered adversarially.
- Ali, S. M., and Silvey, S. D. 1966. A general class of coefficients of divergence of one distribution from another. J. Royal Statist. Soc. Series B 28, 131--142.Google Scholar
- Alon, N., Matias, Y., and Szegedy, M. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1, 137--147. Google Scholar
Digital Library
- Amari, S. 1985. Differential-Geometrical Methods in Statistics. Springer-Verlag, Berlin.Google Scholar
- Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. 2005. Clustering with bregman divergences. J. Mach. Learn. Res. 6, 1705--1749. Google Scholar
Cross Ref
- Bar-Yossef, Z. 2002. The complexity of massive data set computations. Ph.D. thesis, University of California at Berkeley. Google Scholar
Digital Library
- Bar-Yossef, Z., Kumar, R., and Sivakumar, D. 2001. Sampling algorithms: Lower bounds and applications. In Proceedings of the ACM Symposium on Theory of Computing. 266--275. Google Scholar
Digital Library
- Batu, T., Dasgupta, S., Kumar, R., and Rubinfeld, R. 2005. The complexity of approximating the entropy. SIAM J. Comput. 35, 1, 132--150. Google Scholar
Digital Library
- Batu, T., Fortnow, L., Rubinfeld, R., Smith, W. D., and White, P. 2000. Testing that distributions are close. In Proceedings of the IEEE Symposium on Foundations of Computer Science. 259--269. Google Scholar
Digital Library
- Bhuvanagiri, L., and Ganguly, S. 2006. Estimating entropy over data streams. In Proceedings of the Annual European Symposium on Algorithms (ESA). 148--159. Google Scholar
Digital Library
- Chakrabarti, A., Cormode, G., and McGregor, A. 2007. A near-optimal algorithm for computing the entropy of a stream. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 328--335. Google Scholar
Digital Library
- Čencov, N. N. 1981. Statistical decision rules and optimal inference. Transl. Math. Monographs, American Mathematical Society (Providence).Google Scholar
- Chakrabarti, A., Do Ba, K., and Muthukrishnan, S. 2006. Estimating entropy and entropy norm on data streams. In Proceedings of the Symposium on Theoretical Aspects of Computer Science. 196--205. Google Scholar
Digital Library
- Cover, T. M., and Thomas, J. A. 1991. Elements of Information Theory. Wiley Series in Telecommunications. John Wiley&Sons, New York. Google Scholar
Digital Library
- Csiszár, I. 1991. Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Ann. Statist., 2032--2056.Google Scholar
- Demaine, E. D., López-Ortiz, A., and Munro, J. I. 2002. Frequency estimation of internet packet streams with limited space. In Proceedings of the European Symposium on Algorithms. 348--360. Google Scholar
Digital Library
- Dhillon, I. S., Mallela, S., and Kumar, R. 2003. A divisive information-theoretic feature clustering algorithm for text classification. J. Mach. Learn. Res. 3, 1265--1287. Google Scholar
Cross Ref
- Feigenbaum, J., Kannan, S., Strauss, M., and Viswanathan, M. 2002a. An approximate L1 difference algorithm for massive data streams. SIAM J. Comput. 32, 1, 131--151. Google Scholar
Digital Library
- Feigenbaum, J., Kannan, S., Strauss, M., and Viswanathan, M. 2002b. Testing and spot-checking of data streams. Algorithmica 34, 1, 67--80.Google Scholar
Digital Library
- Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. 2001. Surfing wavelets on streams: One-Pass summaries for approximate aggregate queries. In Proceedings of the International Conference on Very Large Databases. 79--88. Google Scholar
Digital Library
- Gu, Y., McCallum, A., and Towsley, D. 2005. Detecting anomalies in network traffic using maximum entropy estimation. In Proceedings of the Internet Measurement Conference. 345--350. Google Scholar
Digital Library
- Guha, S., and McGregor, A. 2006. Approximate quantiles and the order of the stream. In Proceedings of the ACM Symposium on Principles of Database Systems. 273--279. Google Scholar
Digital Library
- Guha, S., and McGregor, A. 2007a. Lower bounds for quantile estimation in random-order and multi-pass streaming. In Proceedings of the International Colloquium on Automata, Languages and Programming. 704--715. Google Scholar
Digital Library
- Guha, S., and McGregor, A. 2007b. Space-Efficient sampling. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). 169--176.Google Scholar
- Guha, S., McGregor, A., and Venkatasubramanian, S. 2006. Streaming and sublinear approximation of entropy and information distances. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 733--742. Google Scholar
Digital Library
- Henzinger, M. R., Raghavan, P., and Rajagopalan, S. 1999. Computing on data streams. In External Memory Algorithms, 107--118. Google Scholar
Digital Library
- Kearns, M. J., Mansour, Y., Ron, D., Rubinfeld, R., Schapire, R. E., and Sellie, L. 1994. On the learnability of discrete distributions. In Proceedings of the ACM Symposium on Theory of Computing. 273--282. Google Scholar
Digital Library
- Krishnamurthy, B., Venkatasubramanian, S., and Madhyastha, H. V. 2005. On stationarity in internet measurements through an information-theoretic lens. In Proceedings of the ICDE Workshops. 1185. Google Scholar
Digital Library
- Lall, A., Sekar, V., Ogihara, M., Xu, J., and Zhang, H. 2006. Data streaming algorithms for estimating entropy of network traffic. In SIGMETRICS/Performance, R. A. Marie, P. B. Key, and E. Smirni, Eds. ACM, 145--156. Google Scholar
Digital Library
- Liese, F., and Vajda, F. 1987. Convex statistical distances. Teubner-Texte zur Mathematik, Band 95, Leipzig.Google Scholar
- Munro, J. I., and Paterson, M. 1980. Selection and sorting with limited storage. Theor. Comput. Sci. 12, 315--323.Google Scholar
Cross Ref
- Shannon, C. E. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379--423 and 623--656.Google Scholar
Cross Ref
- Tishby, N., Pereira, F., and Bialek, W. 1999. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing. 368--377.Google Scholar
- Topsøe, F. 2000. Some inequalities for information divergence and related measures of discrimination. IEEE Trans. Inf. Theory 46, 4, 1602--1609. Google Scholar
Digital Library
- Wagner, A., and Plattner, B. 2005. Entropy based worm and anomaly detection in fast IP networks. In Proceedings of the IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises. 172--177. Google Scholar
Digital Library
- Xu, K., Zhang, Z.-L., and Bhattacharyya, S. 2005. Profiling internet backbone traffic: Behavior models and applications. In Proceedings of the ACM/Data Communications Festival (SIGCOMM). 169--180. Google Scholar
Digital Library
Index Terms
Sublinear estimation of entropy and information distances
Recommendations
A near-optimal algorithm for estimating the entropy of a stream
We describe a simple algorithm for approximating the empirical entropy of a stream of m values up to a multiplicative factor of (1+ϵ) using a single pass, O(ϵ−2 log (δ−1) log m) words of space, and O(log ϵ−1 + log log δ−1 + log log m) processing time ...
Cumulative residual entropy: a new measure of information
In this paper, we use the cumulative distribution of a random variable to define its information content and thereby develop an alternative measure of uncertainty that extends Shannon entropy to random variables with continuous distributions. We call ...
Optimal Bounds for Johnson-Lindenstrauss Transforms and Streaming Problems with Subconstant Error
Special Issue on SODA'11The Johnson-Lindenstrauss transform is a dimensionality reduction technique with a wide range of applications to theoretical computer science. It is specified by a distribution over projection matrices from Rn → Rk where k n and states that k = O(ε−2 ...





Comments