ABSTRACT
A Markov sequence is a basic statistical model representing uncertain sequential data, and it is used within a plethora of applications, including speech recognition, image processing, computational biology, radio-frequency identification (RFID), and information extraction. The problem of querying a Markov sequence is studied under the conventional semantics of querying a probabilistic database, where queries are formulated as finite-state transducers. Specifically, the complexity of two main problems is analyzed. The first problem is that of computing the confidence (probability) of an answer. The second is the enumeration of the answers in the order of decreasing confidence (with the generation of the top-k answers as a special case), or in an approximate order thereof. In particular, it is shown that enumeration in any sub-exponential-approximate order is generally intractable (even for some fixed transducers), and a matching upper bound is obtained through a proposed heuristic. Due to this hardness, a special consideration is given to restricted (yet common) classes of transducers that extract matches of a regular expression (subject to prefix and suffix constraints), and it is shown that these classes are, indeed, significantly more tractable.
- A. J. Bonner and G. Mecca. Sequences, datalog, and transducers. J. Comput. Syst. Sci., 57(3):234--259, 1998. Google Scholar
Digital Library
- A. J. Bonner and G. Mecca. Querying sequence databases with transducers. Acta Inf., 36(7):511--544, 2000. Google Scholar
Digital Library
- J. Boulos, N. N. Dalvi, B. Mandhani, S. Mathur, C. Ré, and D.Suciu. MYSTIQ: a system for finding more answers by using probabilities. In SIGMOD Conference, pages 891--893, 2005. Google Scholar
Digital Library
- M.-Y. Chen, A. Kundu, and J. Zhou. Off-line handwritten word recognition using a hidden Markov model type stochastic network. IEEE Trans. Pattern Anal. Mach. Intell., 16(5):481--496, 1994. Google Scholar
Digital Library
- R. Cheng, D. V. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD Conference, pages 551--562, 2003. Google Scholar
Digital Library
- S. Cohen, B. Kimelfeld, and Y. Sagiv. Generating all maximal induced subgraphs for hereditary and connected-hereditary graph properties. J. Comput. Syst. Sci., 74(7):1147--1159, 2008. Google Scholar
Digital Library
- S. Cohen, B. Kimelfeld, and Y. Sagiv. Running tree automata on probabilistic XML. In PODS, pages 227--236. ACM, 2009. Google Scholar
Digital Library
- N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, pages 864--875, 2004. Google Scholar
Digital Library
- N. N. Dalvi and D. Suciu. The dichotomy of conjunctive queries on probabilistic structures. In PODS, pages 293--302, 2007. Google Scholar
Digital Library
- A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, pages 588--599, 2004. Google Scholar
Digital Library
- Y. Diao, B. Li, A. Liu, L. Peng, C. Sutton, T. Tran, and M. Zink. Capturing data uncertainty in high-volume stream processing. In CIDR, 2009.Google Scholar
- R. G. Downey and M. R. Fellows. Fixed-parameter tractability and completeness I: Basic results. SIAM Journal on Computing, 24(4):873--921, 1995. Google Scholar
Digital Library
- R. Durbin, S. R. Eddy, A. Krogh, and G. J. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.Google Scholar
Cross Ref
- D. Eppstein. Finding the k shortest paths. SIAM J. Comput., 28(2):652--673, 1998. Google Scholar
Digital Library
- B. Escoffier and V. T. Paschos. Differential approximation of min sat, max sat and related problems. In ICCSA (4), volume 3483 of Lecture Notes in Computer Science, pages 192--201. Springer, 2005. Google Scholar
Digital Library
- R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci., 66(4):614--656, 2003. Google Scholar
Digital Library
- C. H. Fosgate, H. Krim, W. W. Irving, W. C. Karl, and A. S. Willsky. Multiscale segmentation and anomaly enhancement of SAR imagery. IEEE Transactions on Image Processing, 6(1):7--20, 1997. Google Scholar
Digital Library
- J. F. Gantz, D. Reinsel, C. Chute, W. Schlichting, J. Mcarthur, S. Minton, I. Xheneti, A. Toncheva, and A. Manfrediz. The expanding digital universe: A forecast of worldwide information growth through 2010, March 2007.Google Scholar
- J. Håstad. Clique is hard to approximate within n1-ε. In FOCS, pages 627--636. IEEE Computer Society, 1996. Google Scholar
Digital Library
- HMMER. Biosequence analysis using hidden Markov models. http://hmmer.janelia.org/.Google Scholar
- HTK. The hidden Markov toolkit, October 2009. http://htk.eng.cam.ac.uk/.Google Scholar
- J. Huang, L. Antova, C. Koch, and D. Olteanu. MayBMS: a probabilistic database management system. In SIGMOD Conference, pages 1071--1074, 2009. Google Scholar
Digital Library
- G. Jirásková. State complexity of some operations on binary regular languages. Theor. Comput. Sci., 330(2):287--298, 2005. Google Scholar
Digital Library
- D. Johnson, M. Yannakakis, and C. Papadimitriou. On generating all maximal independent sets. Information Processing Letters, 27:119--123, 1988. Google Scholar
Digital Library
- B. Kanagal and A. Deshpande. Online filtering, smoothing and probabilistic modeling of streaming data. In ICDE, pages 1160--1169, 2008. Google Scholar
Digital Library
- B. Kanagal and A. Deshpande. Efficient query evaluation over temporally correlated probabilistic streams. In ICDE, pages 1315--1318. IEEE, 2009. Google Scholar
Digital Library
- B. Kanagal and A. Deshpande. Indexing correlated probabilistic databases. In SIGMOD Conference, pages 455--468, 2009. Google Scholar
Digital Library
- S. Kannan, Z. Sweedyk, and S. R. Mahaney. Counting and random generation of strings in regular languages. In SODA, pages 551--557, 1995. Google Scholar
Digital Library
- A. Kempe. Finite state transducers approximating hidden Markov models. In ACL, pages 460--467, 1997. Google Scholar
Digital Library
- B. Kimelfeld, Y. Kosharovsky, and Y. Sagiv. Query efficiency in probabilistic XML models. In SIGMOD Conference, pages 701--714, 2008. Google Scholar
Digital Library
- B. Kimelfeld and C. Ré. Transducing Markov sequences (extended version). Accessible at the second author's home page (http://pages.cs.wisc.edu/ chrisre/), 2010.Google Scholar
- B. Kimelfeld and Y. Sagiv. Finding and approximating top-k answers in keyword proximity search. In PODS, pages 173--182. ACM, 2006. Google Scholar
Digital Library
- B. Kimelfeld and Y. Sagiv. Maximally joining probabilistic data. In PODS, pages 303--312. ACM, 2007. Google Scholar
Digital Library
- B. Kimelfeld and Y. Sagiv. Efficiently enumerating results of keyword search over data graphs. Inf. Syst., 33(4-5):335--359, 2008. Google Scholar
Digital Library
- C. Koch. Approximating predicates and expressive queries on probabilistic databases. In PODS, pages 99--108, 2008. Google Scholar
Digital Library
- C. Koch. A compositional query algebra for second-order logic and uncertain databases. In ICDT, pages 127--140, 2009. Google Scholar
Digital Library
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001. Google Scholar
Digital Library
- E. L. Lawler. A procedure for computing the k best solutions to discrete optimization problems and its application to the shortest path problem. Management Science, 18:401--405, 1972.Google Scholar
Digital Library
- J. Letchner, C. Ré, M. Balazinska, and M. Philipose. Access methods for Markovian streams. In ICDE, pages 246--257, 2009. Google Scholar
Digital Library
- J. Letchner, C. Ré, M. Balazinska, and M. Philipose. Lahar demonstration: Warehousing Markovian streams. PVLDB, 2(2):1610--1613, 2009. Google Scholar
Digital Library
- B. Ludascher, P. Mukhopadhyay, and Y. Papakonstantinou. A transducer-based XML query processor. In VLDB, pages 227--238. Morgan Kaufmann, 2002. Google Scholar
Digital Library
- W. Martens and F. Neven. Typechecking top-down uniform unranked tree transducers. In ICDT, volume 2572 of Lecture Notes in Computer Science, pages 64--78. Springer, 2003. Google Scholar
Digital Library
- K. G. Murty. An algorithm for ranking all the assignments in order of increasing costs. Operations Research, 16:682--687, 1968.Google Scholar
Digital Library
- C. H. Papadimitriou and M. Yannakakis. On the complexity of database queries. Journal of Computer and System Sciences, 58(3):407--427, 1999. Google Scholar
Digital Library
- J. S. Provan and M. O. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM Journal on Computing, 12(4):777--788, 1983.Google Scholar
Digital Library
- L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, pages 257--286, 1989.Google Scholar
Cross Ref
- C. Ré, J. Letchner, M. Balazinska, and D. Suciu. Event queries on correlated probabilistic streams. In SIGMOD Conference, pages 715--728, 2008. Google Scholar
Digital Library
- C. Ré and D. Suciu. Approximate lineage for probabilistic databases. PVLDB, 1(1):797--808, 2008. Google Scholar
Digital Library
- A. D. Sarma, M. Theobald, and J. Widom. Exploiting lineage for confidence computation in uncertain and probabilistic databases. In ICDE, pages 1023--1032, 2008. Google Scholar
Digital Library
- P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequence databases. In ICDE, pages 232--239. IEEE Computer Society, 1995. Google Scholar
Digital Library
- F. Sha and F. Pereira. Shallow parsing with conditional random fields. In HLT-NAACL, 2003. Google Scholar
Digital Library
- F. Sha and L. K. Saul. Large margin hidden Markov models for automatic speech recognition. In NIPS, pages 1249--1256, 2006.Google Scholar
- S. Singh, C. Mayfield, R. Shah, S. Prabhakar, S. E. Hambrusch, J. Neville, and R. Cheng. Database support for probabilistic attributes and tuples. In ICDE, pages 1053--1061, 2008. Google Scholar
Digital Library
- S. Toda and M. Ogiwara. Counting classes are at least as hard as the polynomial-time hierarchy. SIAM J. Comput., 21(2):316--328, 1992. Google Scholar
Digital Library
- T. Tran, C. Sutton, R. Cocci, Y. Nie, Y. Diao, and P. J. Shenoy. Probabilistic inference over RFID streams in mobile environments. In ICDE, pages 1096--1107, 2009. Google Scholar
Digital Library
- L. G. Valiant. The complexity of computing the permanent. Theor. Comput. Sci., 8:189--201, 1979.Google Scholar
- M. Y. Vardi. The complexity of relational query languages (extended abstract). In STOC, pages 137--146. ACM, 1982. Google Scholar
Digital Library
- J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.Google Scholar
- J. Y. Yen. Finding the k shortest loopless paths in a network. Management Science, 17:712--716, 1971.Google Scholar
Digital Library
- S. Zachos. Probabilistic quantifiers and games. Journal of Computer and System Sciences, 36(3):433--451, 1988. Google Scholar
Digital Library
Index Terms
Transducing Markov sequences
Recommendations
Transducing Markov sequences
A Markov sequence is a basic statistical model representing uncertain sequential data, and it is used within a plethora of applications, including speech recognition, image processing, computational biology, radio-frequency identification (RFID), and ...
Semi-hidden Markov models for generation and analysis of sequences
In this work a new kind of stochastic model is presented, the semi-hidden Markov model (SHMM). The proposed model is related to the hidden Markov model (HMM), and it is called semi-hidden because generated sequences need less information than HMM ...
Coding with partially hidden Markov models
DCC '95: Proceedings of the Conference on Data CompressionPartially hidden Markov models (PHMM) are introduced. They are a variation of the hidden Markov models (HMM) combining the power of explicit conditioning on past observations and the power of using hidden states. (P)HMM may be combined with arithmetic ...






Comments