skip to main content
10.1145/1807085.1807090acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Transducing Markov sequences

Published:06 June 2010Publication History

ABSTRACT

A Markov sequence is a basic statistical model representing uncertain sequential data, and it is used within a plethora of applications, including speech recognition, image processing, computational biology, radio-frequency identification (RFID), and information extraction. The problem of querying a Markov sequence is studied under the conventional semantics of querying a probabilistic database, where queries are formulated as finite-state transducers. Specifically, the complexity of two main problems is analyzed. The first problem is that of computing the confidence (probability) of an answer. The second is the enumeration of the answers in the order of decreasing confidence (with the generation of the top-k answers as a special case), or in an approximate order thereof. In particular, it is shown that enumeration in any sub-exponential-approximate order is generally intractable (even for some fixed transducers), and a matching upper bound is obtained through a proposed heuristic. Due to this hardness, a special consideration is given to restricted (yet common) classes of transducers that extract matches of a regular expression (subject to prefix and suffix constraints), and it is shown that these classes are, indeed, significantly more tractable.

References

  1. A. J. Bonner and G. Mecca. Sequences, datalog, and transducers. J. Comput. Syst. Sci., 57(3):234--259, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. J. Bonner and G. Mecca. Querying sequence databases with transducers. Acta Inf., 36(7):511--544, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Boulos, N. N. Dalvi, B. Mandhani, S. Mathur, C. Ré, and D.Suciu. MYSTIQ: a system for finding more answers by using probabilities. In SIGMOD Conference, pages 891--893, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M.-Y. Chen, A. Kundu, and J. Zhou. Off-line handwritten word recognition using a hidden Markov model type stochastic network. IEEE Trans. Pattern Anal. Mach. Intell., 16(5):481--496, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Cheng, D. V. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD Conference, pages 551--562, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Cohen, B. Kimelfeld, and Y. Sagiv. Generating all maximal induced subgraphs for hereditary and connected-hereditary graph properties. J. Comput. Syst. Sci., 74(7):1147--1159, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Cohen, B. Kimelfeld, and Y. Sagiv. Running tree automata on probabilistic XML. In PODS, pages 227--236. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, pages 864--875, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. N. N. Dalvi and D. Suciu. The dichotomy of conjunctive queries on probabilistic structures. In PODS, pages 293--302, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, pages 588--599, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Diao, B. Li, A. Liu, L. Peng, C. Sutton, T. Tran, and M. Zink. Capturing data uncertainty in high-volume stream processing. In CIDR, 2009.Google ScholarGoogle Scholar
  12. R. G. Downey and M. R. Fellows. Fixed-parameter tractability and completeness I: Basic results. SIAM Journal on Computing, 24(4):873--921, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Durbin, S. R. Eddy, A. Krogh, and G. J. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  14. D. Eppstein. Finding the k shortest paths. SIAM J. Comput., 28(2):652--673, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Escoffier and V. T. Paschos. Differential approximation of min sat, max sat and related problems. In ICCSA (4), volume 3483 of Lecture Notes in Computer Science, pages 192--201. Springer, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci., 66(4):614--656, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. H. Fosgate, H. Krim, W. W. Irving, W. C. Karl, and A. S. Willsky. Multiscale segmentation and anomaly enhancement of SAR imagery. IEEE Transactions on Image Processing, 6(1):7--20, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. F. Gantz, D. Reinsel, C. Chute, W. Schlichting, J. Mcarthur, S. Minton, I. Xheneti, A. Toncheva, and A. Manfrediz. The expanding digital universe: A forecast of worldwide information growth through 2010, March 2007.Google ScholarGoogle Scholar
  19. J. Håstad. Clique is hard to approximate within n1-ε. In FOCS, pages 627--636. IEEE Computer Society, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. HMMER. Biosequence analysis using hidden Markov models. http://hmmer.janelia.org/.Google ScholarGoogle Scholar
  21. HTK. The hidden Markov toolkit, October 2009. http://htk.eng.cam.ac.uk/.Google ScholarGoogle Scholar
  22. J. Huang, L. Antova, C. Koch, and D. Olteanu. MayBMS: a probabilistic database management system. In SIGMOD Conference, pages 1071--1074, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. Jirásková. State complexity of some operations on binary regular languages. Theor. Comput. Sci., 330(2):287--298, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Johnson, M. Yannakakis, and C. Papadimitriou. On generating all maximal independent sets. Information Processing Letters, 27:119--123, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Kanagal and A. Deshpande. Online filtering, smoothing and probabilistic modeling of streaming data. In ICDE, pages 1160--1169, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Kanagal and A. Deshpande. Efficient query evaluation over temporally correlated probabilistic streams. In ICDE, pages 1315--1318. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. B. Kanagal and A. Deshpande. Indexing correlated probabilistic databases. In SIGMOD Conference, pages 455--468, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Kannan, Z. Sweedyk, and S. R. Mahaney. Counting and random generation of strings in regular languages. In SODA, pages 551--557, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Kempe. Finite state transducers approximating hidden Markov models. In ACL, pages 460--467, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. Kimelfeld, Y. Kosharovsky, and Y. Sagiv. Query efficiency in probabilistic XML models. In SIGMOD Conference, pages 701--714, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. B. Kimelfeld and C. Ré. Transducing Markov sequences (extended version). Accessible at the second author's home page (http://pages.cs.wisc.edu/ chrisre/), 2010.Google ScholarGoogle Scholar
  32. B. Kimelfeld and Y. Sagiv. Finding and approximating top-k answers in keyword proximity search. In PODS, pages 173--182. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. B. Kimelfeld and Y. Sagiv. Maximally joining probabilistic data. In PODS, pages 303--312. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. B. Kimelfeld and Y. Sagiv. Efficiently enumerating results of keyword search over data graphs. Inf. Syst., 33(4-5):335--359, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. Koch. Approximating predicates and expressive queries on probabilistic databases. In PODS, pages 99--108, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. Koch. A compositional query algebra for second-order logic and uncertain databases. In ICDT, pages 127--140, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. E. L. Lawler. A procedure for computing the k best solutions to discrete optimization problems and its application to the shortest path problem. Management Science, 18:401--405, 1972.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Letchner, C. Ré, M. Balazinska, and M. Philipose. Access methods for Markovian streams. In ICDE, pages 246--257, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Letchner, C. Ré, M. Balazinska, and M. Philipose. Lahar demonstration: Warehousing Markovian streams. PVLDB, 2(2):1610--1613, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. B. Ludascher, P. Mukhopadhyay, and Y. Papakonstantinou. A transducer-based XML query processor. In VLDB, pages 227--238. Morgan Kaufmann, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. W. Martens and F. Neven. Typechecking top-down uniform unranked tree transducers. In ICDT, volume 2572 of Lecture Notes in Computer Science, pages 64--78. Springer, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. K. G. Murty. An algorithm for ranking all the assignments in order of increasing costs. Operations Research, 16:682--687, 1968.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. C. H. Papadimitriou and M. Yannakakis. On the complexity of database queries. Journal of Computer and System Sciences, 58(3):407--427, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. S. Provan and M. O. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM Journal on Computing, 12(4):777--788, 1983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, pages 257--286, 1989.Google ScholarGoogle ScholarCross RefCross Ref
  47. C. Ré, J. Letchner, M. Balazinska, and D. Suciu. Event queries on correlated probabilistic streams. In SIGMOD Conference, pages 715--728, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. C. Ré and D. Suciu. Approximate lineage for probabilistic databases. PVLDB, 1(1):797--808, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. A. D. Sarma, M. Theobald, and J. Widom. Exploiting lineage for confidence computation in uncertain and probabilistic databases. In ICDE, pages 1023--1032, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequence databases. In ICDE, pages 232--239. IEEE Computer Society, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. F. Sha and F. Pereira. Shallow parsing with conditional random fields. In HLT-NAACL, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. F. Sha and L. K. Saul. Large margin hidden Markov models for automatic speech recognition. In NIPS, pages 1249--1256, 2006.Google ScholarGoogle Scholar
  53. S. Singh, C. Mayfield, R. Shah, S. Prabhakar, S. E. Hambrusch, J. Neville, and R. Cheng. Database support for probabilistic attributes and tuples. In ICDE, pages 1053--1061, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. S. Toda and M. Ogiwara. Counting classes are at least as hard as the polynomial-time hierarchy. SIAM J. Comput., 21(2):316--328, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. T. Tran, C. Sutton, R. Cocci, Y. Nie, Y. Diao, and P. J. Shenoy. Probabilistic inference over RFID streams in mobile environments. In ICDE, pages 1096--1107, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. L. G. Valiant. The complexity of computing the permanent. Theor. Comput. Sci., 8:189--201, 1979.Google ScholarGoogle Scholar
  57. M. Y. Vardi. The complexity of relational query languages (extended abstract). In STOC, pages 137--146. ACM, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.Google ScholarGoogle Scholar
  59. J. Y. Yen. Finding the k shortest loopless paths in a network. Management Science, 17:712--716, 1971.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. S. Zachos. Probabilistic quantifiers and games. Journal of Computer and System Sciences, 36(3):433--451, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Transducing Markov sequences

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!