skip to main content
research-article

Storing semi-structured data on disk drives

Published:12 June 2009Publication History
Skip Abstract Section

Abstract

Applications that manage semi-structured data are becoming increasingly commonplace. Current approaches for storing semi-structured data use existing storage machinery; they either map the data to relational databases, or use a combination of flat files and indexes. While employing these existing storage mechanisms provides readily available solutions, there is a need to more closely examine their suitability to this class of data. Particularly, retrofitting existing solutions for semi-structured data can result in a mismatch between the tree structure of the data and the access characteristics of the underlying storage device (disk drive). This study explores various possibilities in the design space of native storage solutions for semi-structured data by exploring alternative approaches that match application data access characteristics to those of the underlying disk drive. For evaluating the effectiveness of the proposed native techniques in relation to the existing solution, we experiment with XML data using the XPathMark benchmark. Extensive evaluation reveals the strengths and weaknesses of the proposed native data layout techniques. While the existing solutions work really well for deep-focused queries into a semi-structured document (those that result in retrieving entire subtrees), the proposed native solutions substantially outperform for the non-deep-focused queries, which we demonstrate are at least as important as the deep-focused. We believe that native data layout techniques offer a unique direction for improving the performance of semi-structured data stores for a variety of important workloads. However, given that the proposed native techniques require circumventing current storage stack abstractions, further investigation is warranted before they can be applied to general-purpose storage systems.

References

  1. Abiteboul, S., Agrawal, R., Bernstein, P., Carey, M., Ceri, S., et al. 2005. The Lowell database research self-assessment. Comm. ACM 48, 5, 111--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Afanasiev, L., Manolescu, I., and Michiels, P. 2005. Member: A micro-benchmark repository for XQuery. In Proceedings of the 3rd International XML Database Symposium on Database and XML Technologies (XSym'05). S. Bressan et al., Eds. Lecture Notes in Computer Science, vol. 3671. Springer, 144--161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Afanasiev, L. and Marx, M. 2006. An analysis of the current xquery benchmarks. In ExpDB. 9--20.Google ScholarGoogle Scholar
  4. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 3, 403--410.Google ScholarGoogle ScholarCross RefCross Ref
  5. Barbosa, D., Barta, A., Mendelzon, A. O., Mihaila, G. A., Rizzolo, F., and Rodriguez-Guianolli, P. 2001. Tox - The Toronto XML engine. In Proceedings of the Workshop on Information Integration on the Web. 66--73.Google ScholarGoogle Scholar
  6. Bedathur, S. and Haritsa, J. 2006. Search-Optimized suffix-tree storage for biological applications. In Proceedings of the12th IEEE International Conference on High Performance Computing (HiPC). D. A. Bader et al., Eds. Lecture Notes in Computer Science, vol. 3769, 29--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Beyer, K., Cochrane, R. J., Josifovski, V., Kleewein, J., Lapis, G., et al. 2005. System rx: One part relational, one part XML. In ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bhadkamkar, M., Farfan, F., Hristidis, V., and Rangaswami, R. 2006. Efficient native storage for semi-structured data (extended paper version). http://www.cis.fiu.edu/SSS/NativeXMLextended.pdf.Google ScholarGoogle Scholar
  9. Bohannon, P., Freire, J., Roy, P., and Siméon, J. 2002. From XML schema to relations: A cost-based approach to XML storage. In Proceedings of the International Conference on Data Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Böhme, T. and Rahm, E. 2001. Xmach-1: A benchmark for XML data management. In Datenbanksysteme in Büro, Technik und Wissenschaft (BTW), 9. GI-Fachtagung. Springer-Verlag, 264--273. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Böhme, T. and Rahm, E. 2003. Multi-User evaluation of XML data management systems with Xmach-1. In Proceedings of the VLDB Workshop EEXTT and CAiSE Workshop DTWeb on Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web-Revised Papers. Springer-Verlag,148--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bressan, S., Dobbie, G., Lacroix, Z., Lee, M. L., Li, Y. G., Nambiar, U., and Wadhwa, B. XOO7: Applying OO7 benchmark to XML query processing tool. In Proceedings of the 10th International Conference on Information and Knowledge Management. 167--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Bucy, J., Ganger, G., and Contributors. 2003. The DiskSim simulation environment version 3.0 reference manual. Tech. rep. CMU-CS-03-102, Carnegie Mellon University.Google ScholarGoogle Scholar
  14. Carey, M., DeWitt, D., Franklin, M., Hall, N., McAuliffe, M., et al. 1994. Shoring up persistent applications. In ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. CDA. 2007. HL7 clinical document architecture, release 2.0. http://lists.hl7.org/read/attachment/61225/1/CDA-doc 20version.pdf. 2007.Google ScholarGoogle Scholar
  16. Delcher, A., Kasif, S., Fleischmann, R., Peterson, J., White, O., and Salzberg, S. 1999. Alignment of whole genomes. Nucleic Acids Res. 27, 11, 2369--2376.Google ScholarGoogle ScholarCross RefCross Ref
  17. Deutsch, A., Fernandez, M. F., and Suciu, D. 1999. Storing semistructured data with STORED. In ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Dimitrijevic, Z., Rangaswami, R., Chang, E., Watson, D., and Acharya, A. 2004. Diskbench: User-Level disk feature extraction tool. Tech. rep. TR-2004-18, University of California at Santa Barbara.Google ScholarGoogle Scholar
  19. Dolin, R. H., Alschuler, L., Boyer, S., Beebe, C., Behlen, F. M., Biron, P. V., and Shabo Shvo, A. 2006. HL7 clinical document architecture release 2. J. Amer. Med. Inf. Assoc. 13, 1.Google ScholarGoogle ScholarCross RefCross Ref
  20. Du, F., Amer-Yahia, S., and Freire, J. ShreX: Managing XML documents in relational databases. In Proceedings of the 30th International Conference on Very Large Databases. 1297--1300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Farfan, F., Hristidis, V., and Rangaswami, R. 2007. Beyond lazy XML parsing. In Proceedings of International Conference on Database and Expert Systems Applications (DEXA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Finkelstein, A., Jacobs, C. E., and Salesin, D. H. 1996. Multiresolution video. In Proceedings of the ACM SIGGRAPH International Conference on Computer Graphics and Interactive Techniques, 281--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Franceschet, M. 2004. XPathMark: An XPath benchmark for XMark. Tech. rep. PP-2004-04, University of Amsterdam.Google ScholarGoogle Scholar
  24. Franceschet, M. 2005. XPathMark: An XPath benchmark for the XMark generated data. Lecture Notes in Computer Science, vol. 3671. Springer, 129--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Galax. 2007. Galax. 2007. Galax homepage. http://www.galaxquery.org.Google ScholarGoogle Scholar
  26. Ganger, G. R. 2001. Blurring the line between OSes and storage devices. Tech. rep. CMU-CS-01-166, Carnegie Mellon University.Google ScholarGoogle Scholar
  27. Gibson, G. A., Nagle, D. F., Amiri, K., Butler, J., Chang, F. W., et al. A cost-effective, high-bandwidth storage architecture. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. GML. 2008. Geography markup language. http://opengis.net/gml/.Google ScholarGoogle Scholar
  29. Gottlob, G., Koch, C., and Pichler, R. 2002. Efficient algorithms for processing XPath queries. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. HL7. 2008. Health level seven XML. http://www.hl7.org/special/Committees/xml/xml.htm.Google ScholarGoogle Scholar
  31. Huston, L., Sukthankar, R., Wickremesinghe, R., Satyanarayanan, M., Ganger, G. R., Riedel, E., and Ailamaki, A. 2004. Diamond: A storage architecture for early discard in interactive search. In Proceedings of the USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Iyer, S. and Druschel, P. 2001. Anticipatory scheduling: A disk scheduling framework to overcome deceptive idleness in synchronous I/O. In Proceedings of the Symposium on Operating Systems Principles. 117--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jagadish, H. V., Al-Khalifa, S., Chapman, A., Lakshmanan, L. V. S., Nierman, A., et al. 2002. TIMBER: A native XML database. VLDB J. 11, 4, 274--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kanne, C. and Moerkotte, G. 1999. Efficient storage of XML data. Tech. rep., Universitaet Mannheim.Google ScholarGoogle Scholar
  35. Kanne, C.-C., Brantner, M., and Moerkotte, G. 2005. Cost-Sensitive reordering of navigational primitives. In Proceedings of the ACM SIGMOD International Conference on Management Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kanne, C.-C. and Moerkotte, G. 2006. A linear time algorithm for optimal tree sibling partitioning and approximation algorithms in Natix. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kaushik, R., Bohannon, P., Naughton, J. F., and Korth, H. F. 2002. Covering indexes for branching path queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Keeton, K., Patterson, D. A., and Hellerstein, J. M. 1998. A case for intelligent disks (IDISKS). SIGMOD Rec. 27, 3, 42--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Kundu, S. and Misra, J. 1977. A linear tree partition algorithm. SIAM J. Comput. 6, 1,151--154.Google ScholarGoogle ScholarCross RefCross Ref
  40. Li, Q. and Moon, B. 2001. Indexing and querying XML data for regular path expressions. VLDB J. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Manolescu, I., Miachon, C., and Michiels, P. 2006. Towards micro-benchmarking Xquery. In Proceedings of the International Workshop on Performance and Evaluation of Data Management Systems (ExpDB), 28--39.Google ScholarGoogle Scholar
  42. McHugh, J., Abiteboul, S., Goldman, R., Quass, D., and Widom, J. 1997. Lore: A database management system for semistructured data. SIGMOD Rec. 26, 3, 54--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Meng, X., Luo, D., Lee, M.-L., and An, J. 2003. Orientstore: A schema based native XML storage system. In Proceedings of the International Conference on Very Large Databases, 1057--1060. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Mergen, S. L. S. and Heuser, C. A. 2004. Matching of XML schemas and relational schemas. In Proceedings of the Brazilian Symposium on Databases (SBBD).Google ScholarGoogle Scholar
  45. MML. 2008. Medical markup language. http://www.ncbi.nlm.nih.gov/.Google ScholarGoogle Scholar
  46. Nambiar, U., Lacroix, Z., Bressan, S., Lee, M., and Li, Y. 2001. XML benchmarks put to the test. http://www.comp.nus.edu.sg/~liyg/publication/iiwas01.pdf.Google ScholarGoogle Scholar
  47. Nekrestyanov, I., Novikov, B., and Pavlova, E. 2000. An analysis of alternative methods for storing semistructured data in relations. In Proceedings of the East European Conference on Advances in Databases and Information Systems -- International Conference on Database Systems for Advanced Applications (ADBIS-DASFAA). 354--361. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Nicola, M. and John, J. 2003. XML parsing: A threat to database performance. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). 175--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Nicola, M., Kogan, I., and Schiefer, B. 2007. An XML transaction processing benchmark. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 937--948. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Noga, M. L., Schott, S., and Lowe, W. 2002. Lazy XML processing. In Proceedings of the ACM Symposium on Document Engineering. 88--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. ODS. 2008. Open document specification v1.0. http://www.oasis-open.org/committees/download.php/12572/OpenDocument-v1.0-os.pdf.Google ScholarGoogle Scholar
  52. OOX. 2008. Openoffice XML file format v1.0.Google ScholarGoogle Scholar
  53. Papakonstantinou, Y., Garcia-Molina, H., and Widom, J. 1995. Object exchange across heterogeneous information sources. In Proceedings of the 11th International Conference on Data Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ramanath, M., Freire, J., Haritsa, J., and Roy, P. 2003. Searching for efficient XML to relational mappings. Tech. rep. TR-2003-01, DSL/SERC.Google ScholarGoogle Scholar
  55. Riedel, E., Gibson, G., and Faloutsos, C. 1998. Active storage for large-scale data mining and multimedia. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Rokhsar, D. 2007. Computational analysis of genomic sequence data. http://www.nersc.gov/news/annual reports/annrep01/sh BER 06.html.Google ScholarGoogle Scholar
  57. Ruemmler, C. and Wilkes, J. 1994. An introduction to disk drive modeling. Comput. 27, 3, 17--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Runapongsa, K., Patel, J., Jagadish, H., Chen, Y., and Al-Khalifa, S. 2003. The Michigan benchmark: Towards XML query performance diagnostics. Inf. Syst. J. 31, 2, 73--97.Google ScholarGoogle ScholarCross RefCross Ref
  59. Schindler, J., Schlosser, S. W., Shao, M., Ailamaki, A., and Ganger, G. R. 2004. Atropos: A disk array volume manager for orchestrated use of disks. In Proceedings of the USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Schlosser, S. W., Schindler, J., Papadomanolakis, S., Shao, M., Ailamaki, A., Faloutsos, C., and Ganger, G. R. 2005. On multidimensional fata and modern disks. In Proceedings of the 4th USENIX Conference on File and Storage Technology. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Schmidt, A., Waas, F., Kersten, M., Carey, M., Manolescu, I., and Busse, R. 2002a. Xmark: A benchmark for XML data management. In Proceedings of the 28th International Conference on Very Large Databases. 974--985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., DeWitt, D. J., and Naughton, J. F. 1999. Relational databases for querying XML documents: Limitations and opportunities. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Sivathanu, M., Prabhakaran, V., Popovici, F. I., Denehy, T. E., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2003. Semantically-smart disk systems. In Proceedings of the USENIX Symposium on File and Storage Technologies. 73--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. SVG. 2008. Scalable vector graphics. http://www.w3.org/Graphics/SVG/.Google ScholarGoogle Scholar
  65. Talagala, N., Arpaci-Dusseau, R. H., and Patterson, D. 1999. Micro-benchmark-Based extraction of local and global disk characteristics. Tech. rep., University of California at Berkeley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Worthington, B., Ganger, G., Patt, Y., and Wilkes, J. 1995. Online extraction of SCSI disk drive parameters. In Proceedings of ACM SIGMETRICS Conference. 146--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Xalan. 2007. Xalan-Java. http://xml.apache.org/xalan-j.Google ScholarGoogle Scholar
  68. XPath. 2007. XML path language (XPath) version 1.0. http://www.w3.org/TR/xpath.Google ScholarGoogle Scholar
  69. XT. 2007. XT homepage. http://www.blnz.com/xt/index.html.Google ScholarGoogle Scholar
  70. Yao, B. B., Özsu, M. T., and Keenleyside, J. 2003. Xbench - A family of benchmarks for XML dbmss. In Proceedings of the VLDB Workshop EEXTT and CAiSE Workshop DTWeb on Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web-Revised Papers. Springer-Verlag, 162--164. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Storing semi-structured data on disk drives

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Storage
          ACM Transactions on Storage  Volume 5, Issue 2
          June 2009
          95 pages
          ISSN:1553-3077
          EISSN:1553-3093
          DOI:10.1145/1534912
          Issue’s Table of Contents

          Copyright © 2009 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 June 2009
          • Accepted: 1 January 2009
          • Revised: 1 December 2008
          • Received: 1 December 2007
          Published in tos Volume 5, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!