skip to main content
research-article

Fast XML document filtering by sequencing twig patterns

Authors Info & Claims
Published:14 October 2009Publication History
Skip Abstract Section

Abstract

XML-enabled publish-subscribe (pub-sub) systems have emerged as an increasingly important tool for e-commerce and Internet applications. In a typical pub-sub system, subscribed users specify their interests in a profile expressed in the XPath language. Each new data content is then matched against the user profiles so that the content is delivered only to the interested subscribers. As the number of subscribed users and their profiles can grow very large, the scalability of the service is critical to the success of pub-sub systems. In this article, we propose a novel scalable filtering system called iFiST that transforms user profiles of a twig pattern expressed in XPath into sequences using the Prüfer's method. Consequently, instead of breaking a twig pattern into multiple linear paths and matching them separately, FiST performs holistic matching of twig patterns with each incoming document in a bottom-up fashion. FiST organizes the sequences into a dynamic hash-based index for efficient filtering, and exploits the commonality among user profiles to enable shared processing during the filtering phase. We demonstrate that the holistic matching approach reduces filtering cost and memory consumption, thereby improving the scalability of FiST.

References

  1. Altinel, M. and Franklin, M. J. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of the 26th VLDB Conference. 53--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Apache. Apache Xerces C++ parser. http://xml.apache.org/xerces-c/.Google ScholarGoogle Scholar
  3. Bar-Yossef, Z., Fontoura, M., and Josifovski, V. 2004. On the memory requirements of XPath evaluation over XML streams. In Proceedings of the 23rd ACM Symposium on Principles of Database Systems. 177--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bar-Yossef, Z., Fontoura, M., and Josifovski, V. 2005. Buffering in query evaluation over XML streams. In Proceedings of the 24th ACM Symposium on Principles of Database Systems. 216--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Berglund, A., Boag, S., Chamberlin, D., Fernáandez, M. F., Kay, M., Robie, J., and Siméon, J. XML path language (XPath) 2.0. http://www.w3.org/TR/xpath20/.Google ScholarGoogle Scholar
  6. Boag, S., Chamberlin, D., Fernández, M. F., Florescu, D., Robie, J., and Siméon, J. XQuery 1.0: An XML query language. http://www.w3.org/TR/xquery/.Google ScholarGoogle Scholar
  7. Bow, C., Hughes, B., and Bird, S. 2003. Towards a general model of interlinear text. In Proceedings of EMELD Workshop.Google ScholarGoogle Scholar
  8. Bruno, N., Gravano, L., Koudas, N., and Srivastava, D. 2003. Navigation- vs. index-based XML multi-query processing. In Proceedings of the 19th IEEE International Conference on Data Engineering. 139--150.Google ScholarGoogle Scholar
  9. Bruno, N., Koudas, N., and Srivastava, D. 2002. Holistic twig joins: Optimal XML pattern matching. In Proceedings of the 2002 ACM-SIGMOD Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Candan, K. S., Hsiung, W.-P., Chen, S., Tatemura, J., and Agrawal, D. 2006. AFilter: Adaptable XML filtering with prefix-caching and suffix-clustering. In Proceedings of the 32nd VLDB Conference. 559--570. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Carzaniga, A., Rutherford, M. J., and Wolf, A. L. 2004. A routing scheme for content-based networking. In Proceedings of IEEE InfoCom 2004. 918--928.Google ScholarGoogle Scholar
  12. Castro, M., Druschel, P., marie Kermarrec, A., and Rowstron, A. 2002. Scribe: A large-scale and decentralized application-level multicast infrastructure. IEEE J. Select. Areas Comm. 20, 8, 1489--1499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chan, C. Y., Felber, P., Garofalakis, M. N., and Rastogi, R. 2002a. Efficient filtering of XML documents with XPath expressions. In Proceedings of the 18th IEEE International Conference on Data Engineering. 235--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chan, C. Y., Felber, P., Garofalakis, M. N., and Rastogi, R. 2002b. Efficient filtering of XML documents with XPath expressions. VLDB J. 11, 4, 354--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chan, C. Y. and Ni, Y. 2007. Efficient XML data dissemination with piggybacking. In Proceedings of the ACM-SIGMOD Conference. 737--748. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Chandramouli, B., Phillips, J., and Yang, J. 2007. Value-Based notification conditions in large-scale publish/subscribe systems. In Proceedings of the 33rd VLDB Conference. 878--889. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chen, S., Li, H.-G., Tatemura, J., Hsiung, W.-P., Agrawal, D., and Candan, K. S. 2006a. Twig2 Stack: Bottom-Up processing of generalized-tree-pattern queries over XML documents. In Proceedings of the 32nd VLDB Conference. 283--294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Chen, Y., Davidson, S. B., and Zheng, Y. 2006b. An efficient XPath query processor for XML streams. In Proceedings of the 22nd IEEE International Conference on Data Engineering. 79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Chiu, A.-T. and Hsu, J.-L. 2006. An automaton-based filtering system for streaming musicxml. In Proceedings of the International Conference on Semantic Web&Web Services. 177--178.Google ScholarGoogle Scholar
  20. Clark, J. 1999. XSL transformations (XSLT) version 1.0. http://www.w3.org/TR/xslt/.Google ScholarGoogle Scholar
  21. Diao, Y., Altinel, M., Michael J. Franklin, H. Z., and Fischer, P. 2003. Path sharing and predicate evaluation for high-performance XML filtering. ACM Trans. Datab. Syst. 28, 4, 467--516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Diao, Y., Rizvi, S., and Franklin, M. J. 2004. Towards an Internet-scale xml dissemination service. In Proceedings of the International Conference on Very Large Databases. 612--623. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Diaz, A. L. and Lovell, D. 1999. XML generator. http://www.alphaworks.ibm.com/tech/xmlgenerator.Google ScholarGoogle Scholar
  24. Fenner, W. and Srivastava, D. 2005. XTreeNet: Scalable overlay networks for XML content dissemination and querying. In Proceedings of the 10th International Workshop on Web Content Caching and Distribution. 41--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Gong, X., Yan, Y., Qian, W., and Zhou, A. 2005. Bloom filter-based XML packets filtering for millions of path queries. In Proceedings of the 21st IEEE International Conference on Data Engineering. 890--901. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Green, T. J., Gupta, A., Miklau, G., Onizuka, M., and Suciu, D. 2004. Processing XML streams with deterministic automata and stream indexes. ACM Trans. Datab. Syst. 29, 4, 752--788. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Green, T. J., Miklau, G., Onizuka, M., and Suciu, D. 2003. Processing XML streams with deterministic automata. In Proceedings of the 9th International Conference on Database Theory. 173--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Gupta, A. K. and Suciu, D. 2003. Stream processing of XPath queries with predicates. In Proceedings of the ACM-SIGMOD Conference. ACM Press, 419--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. He, B., Luo, Q., and Choi, B. 2005. Cache-Conscious automata for XML filtering. In Proceedings of the 21st IEEE International Conference on Data Engineering. 878--889. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. He, B., Luo, Q., and Choi, B. 2006. Cache-Conscious automata for XML filtering. IEEE Trans. Knowl. Data Eng. 18, 12, 1629--1644. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Hong, M., Demers, A. J., Gehrke, J., Koch, C., Riedewald, M., and White, W. M. 2007. Massively multi-query join processing in publish/subscribe systems. In Proceedings of the ACM-SIGMOD Conference. 761--772. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Hou, S. and Jacobsen, H.-A. 2006. Predicate-Based filtering of XPath expressions. In Proceedings of the 22nd IEEE International Conference on Data Engineering. 53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kwon, J., Rao, P., Moon, B., and Lee, S. 2005. FiST: Scalable XML document filtering by sequencing twig patterns. In Proceedings of the 31st VLDB Conference. 217--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kwon, J., Rao, P., Moon, B., and Lee, S. 2007. Value-Based predicate filtering of streaming XML data. In Proceedings of the 1st International Workshop on Data Management in Ubiquitous Computing. 266--271.Google ScholarGoogle Scholar
  35. Kwon, J., Rao, P., Moon, B., and Lee, S. 2008. Value-Based predicate filtering of XML documents. Data Knowl. Eng. 67, 1, 51--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Lewis, W. D. Personal communications. http://zimmer.csufresno.edu/~wlewis/.Google ScholarGoogle Scholar
  37. Li, G., Hou, S., and Jacobsen, H.-A. 2008. Routing of XML and XPath queries in data dissemination networks. In Proceedings of the 28th International Conference on Distributed Computing Systems. 627--638. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Li, Q. and Moon, B. 2001. Indexing and querying XML data for regular path expressions. In Proceedings of the 27th VLDB Conference. 361--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Lu, J., Chen, T., and Ling, T. W. 2004. Efficient processing of xml twig patterns with parent child edges: A look-ahead approach. In Proceedings of ACM the 13th International Conference on Information and Knowledge Management. 533--542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ludäscher, B., Mukhopadhyay, P., and Papakonstantinou, Y. 2002. A transducer-based XML query processor. In Proceedings of the 28th VLDB Conference. 227--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Megginson, D. Simple API for XML. http://sax.sourceforge.net/.Google ScholarGoogle Scholar
  42. Miliaraki, I., Kaoudi, Z., and Koubarakis, M. 2008. Xml data dissemination using automata on top of structured overlay networks. In Proceedings of the 17th International World Wide Web Conference. ACM, New York, 865--874. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Milo, T., Zur, T., and Verbin, E. 2007. Boosting topic-based publish-subscribe systems with dynamic clustering. In Proceedings of the ACM-SIGMOD Conference. 749--760. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Moro, M. M., Bakalov, P., and Tsotras, V. J. 2007. Early profile pruning on XML-aware publish-subscribe systems. In Proceedings of the 33rd VLDB Conference. 866--877. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Müller, K. 2004. Semi-Automatic construction of a question treebank. In Proceedings of the 4th International Conference on Language Resources and Evaluation.Google ScholarGoogle Scholar
  46. MusicXML. MusicXML definition. http://www.recordare.com//xml.html.Google ScholarGoogle Scholar
  47. NITF. NITF: News industry text format. http://www.nitf.org/.Google ScholarGoogle Scholar
  48. Peng, F. and Chawathe, S. S. 2003. XPath queries on streaming data. In Proceedings of the ACM-SIGMOD Conference. ACM Press, 431--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Prüfer, H. 1918. Neuer Beweis eines satzes über permutationen. Archiv für Mathematik und Physik 27, 142--144.Google ScholarGoogle Scholar
  50. Ramasubramanian, V., Peterson, R., and Sirer, E. G. 2006. Corona: A high performance publish-subscribe system for the World Wide Web. In Proceedings of the 3rd Conference on Networked Systems Design&Implementation (NSDI'06). USENIX Association, 2--2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Rao, P., Cappos, J., Khare, V., Moon, B., and Zhang, B. 2007. Net-x: Unified data-centric Internet services. In Proceedings of 3rd International Workshop on Networking Meets Databases (NetDB'07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Rao, P. and Moon, B. 2004. PRIX: Indexing and querying XML using Prüfer sequences. In Proceedings of the 20th IEEE International Conference on Data Engineering. 288--299. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Rao, P. and Moon, B. 2006. Sequencing XML data and query twigs for fast pattern matching. ACM Trans. Datab. Syst. 31, 1, 299--345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Shah, R., Ramzan, Z., Jain, R., Dendukuri, R., and Anjum, F. 2004. Efficient dissemination of personalized information using content-based multicast. IEEE Trans. Mobile Comput. 3, 4, 394--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Tian, F., Reinwald, B., Pirahesh, H., Mayr, T., and Myllymaki, J. 2004. Implementing a scalable XML publish/subscribe system using a relational database system. In Proceedings of the ACM-SIGMOD Conference. 479--490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Treebank. The Penn treebank project. http://www.cis.upenn.edu/~treebank/.Google ScholarGoogle Scholar
  57. XMark. XMark - An XML benchmark project. http://www.xml-benchmark.org/.Google ScholarGoogle Scholar
  58. Yan, T. W. and Garcia-Molina, H. 1999. The SIFT information dissemination system. ACM Trans. Datab. Syst. 24, 4, 529--565. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast XML document filtering by sequencing twig patterns

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader
                About Cookies On This Site

                We use cookies to ensure that we give you the best experience on our website.

                Learn more

                Got it!