Abstract
Applications that manage semi-structured data are becoming increasingly commonplace. Current approaches for storing semi-structured data use existing storage machinery; they either map the data to relational databases, or use a combination of flat files and indexes. While employing these existing storage mechanisms provides readily available solutions, there is a need to more closely examine their suitability to this class of data. Particularly, retrofitting existing solutions for semi-structured data can result in a mismatch between the tree structure of the data and the access characteristics of the underlying storage device (disk drive). This study explores various possibilities in the design space of native storage solutions for semi-structured data by exploring alternative approaches that match application data access characteristics to those of the underlying disk drive. For evaluating the effectiveness of the proposed native techniques in relation to the existing solution, we experiment with XML data using the XPathMark benchmark. Extensive evaluation reveals the strengths and weaknesses of the proposed native data layout techniques. While the existing solutions work really well for deep-focused queries into a semi-structured document (those that result in retrieving entire subtrees), the proposed native solutions substantially outperform for the non-deep-focused queries, which we demonstrate are at least as important as the deep-focused. We believe that native data layout techniques offer a unique direction for improving the performance of semi-structured data stores for a variety of important workloads. However, given that the proposed native techniques require circumventing current storage stack abstractions, further investigation is warranted before they can be applied to general-purpose storage systems.
- Abiteboul, S., Agrawal, R., Bernstein, P., Carey, M., Ceri, S., et al. 2005. The Lowell database research self-assessment. Comm. ACM 48, 5, 111--118. Google Scholar
Digital Library
- Afanasiev, L., Manolescu, I., and Michiels, P. 2005. Member: A micro-benchmark repository for XQuery. In Proceedings of the 3rd International XML Database Symposium on Database and XML Technologies (XSym'05). S. Bressan et al., Eds. Lecture Notes in Computer Science, vol. 3671. Springer, 144--161. Google Scholar
Digital Library
- Afanasiev, L. and Marx, M. 2006. An analysis of the current xquery benchmarks. In ExpDB. 9--20.Google Scholar
- Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 3, 403--410.Google Scholar
Cross Ref
- Barbosa, D., Barta, A., Mendelzon, A. O., Mihaila, G. A., Rizzolo, F., and Rodriguez-Guianolli, P. 2001. Tox - The Toronto XML engine. In Proceedings of the Workshop on Information Integration on the Web. 66--73.Google Scholar
- Bedathur, S. and Haritsa, J. 2006. Search-Optimized suffix-tree storage for biological applications. In Proceedings of the12th IEEE International Conference on High Performance Computing (HiPC). D. A. Bader et al., Eds. Lecture Notes in Computer Science, vol. 3769, 29--39. Google Scholar
Digital Library
- Beyer, K., Cochrane, R. J., Josifovski, V., Kleewein, J., Lapis, G., et al. 2005. System rx: One part relational, one part XML. In ACM SIGMOD International Conference on Management of Data. Google Scholar
Digital Library
- Bhadkamkar, M., Farfan, F., Hristidis, V., and Rangaswami, R. 2006. Efficient native storage for semi-structured data (extended paper version). http://www.cis.fiu.edu/SSS/NativeXMLextended.pdf.Google Scholar
- Bohannon, P., Freire, J., Roy, P., and Siméon, J. 2002. From XML schema to relations: A cost-based approach to XML storage. In Proceedings of the International Conference on Data Engineering. Google Scholar
Digital Library
- Böhme, T. and Rahm, E. 2001. Xmach-1: A benchmark for XML data management. In Datenbanksysteme in Büro, Technik und Wissenschaft (BTW), 9. GI-Fachtagung. Springer-Verlag, 264--273. Google Scholar
Digital Library
- Böhme, T. and Rahm, E. 2003. Multi-User evaluation of XML data management systems with Xmach-1. In Proceedings of the VLDB Workshop EEXTT and CAiSE Workshop DTWeb on Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web-Revised Papers. Springer-Verlag,148--158. Google Scholar
Digital Library
- Bressan, S., Dobbie, G., Lacroix, Z., Lee, M. L., Li, Y. G., Nambiar, U., and Wadhwa, B. XOO7: Applying OO7 benchmark to XML query processing tool. In Proceedings of the 10th International Conference on Information and Knowledge Management. 167--174. Google Scholar
Digital Library
- Bucy, J., Ganger, G., and Contributors. 2003. The DiskSim simulation environment version 3.0 reference manual. Tech. rep. CMU-CS-03-102, Carnegie Mellon University.Google Scholar
- Carey, M., DeWitt, D., Franklin, M., Hall, N., McAuliffe, M., et al. 1994. Shoring up persistent applications. In ACM SIGMOD International Conference on Management of Data. Google Scholar
Digital Library
- CDA. 2007. HL7 clinical document architecture, release 2.0. http://lists.hl7.org/read/attachment/61225/1/CDA-doc 20version.pdf. 2007.Google Scholar
- Delcher, A., Kasif, S., Fleischmann, R., Peterson, J., White, O., and Salzberg, S. 1999. Alignment of whole genomes. Nucleic Acids Res. 27, 11, 2369--2376.Google Scholar
Cross Ref
- Deutsch, A., Fernandez, M. F., and Suciu, D. 1999. Storing semistructured data with STORED. In ACM SIGMOD International Conference on Management of Data. Google Scholar
Digital Library
- Dimitrijevic, Z., Rangaswami, R., Chang, E., Watson, D., and Acharya, A. 2004. Diskbench: User-Level disk feature extraction tool. Tech. rep. TR-2004-18, University of California at Santa Barbara.Google Scholar
- Dolin, R. H., Alschuler, L., Boyer, S., Beebe, C., Behlen, F. M., Biron, P. V., and Shabo Shvo, A. 2006. HL7 clinical document architecture release 2. J. Amer. Med. Inf. Assoc. 13, 1.Google Scholar
Cross Ref
- Du, F., Amer-Yahia, S., and Freire, J. ShreX: Managing XML documents in relational databases. In Proceedings of the 30th International Conference on Very Large Databases. 1297--1300. Google Scholar
Digital Library
- Farfan, F., Hristidis, V., and Rangaswami, R. 2007. Beyond lazy XML parsing. In Proceedings of International Conference on Database and Expert Systems Applications (DEXA). Google Scholar
Digital Library
- Finkelstein, A., Jacobs, C. E., and Salesin, D. H. 1996. Multiresolution video. In Proceedings of the ACM SIGGRAPH International Conference on Computer Graphics and Interactive Techniques, 281--290. Google Scholar
Digital Library
- Franceschet, M. 2004. XPathMark: An XPath benchmark for XMark. Tech. rep. PP-2004-04, University of Amsterdam.Google Scholar
- Franceschet, M. 2005. XPathMark: An XPath benchmark for the XMark generated data. Lecture Notes in Computer Science, vol. 3671. Springer, 129--143. Google Scholar
Digital Library
- Galax. 2007. Galax. 2007. Galax homepage. http://www.galaxquery.org.Google Scholar
- Ganger, G. R. 2001. Blurring the line between OSes and storage devices. Tech. rep. CMU-CS-01-166, Carnegie Mellon University.Google Scholar
- Gibson, G. A., Nagle, D. F., Amiri, K., Butler, J., Chang, F. W., et al. A cost-effective, high-bandwidth storage architecture. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- GML. 2008. Geography markup language. http://opengis.net/gml/.Google Scholar
- Gottlob, G., Koch, C., and Pichler, R. 2002. Efficient algorithms for processing XPath queries. In Proceedings of the International Conference on Very Large Databases. Google Scholar
Digital Library
- HL7. 2008. Health level seven XML. http://www.hl7.org/special/Committees/xml/xml.htm.Google Scholar
- Huston, L., Sukthankar, R., Wickremesinghe, R., Satyanarayanan, M., Ganger, G. R., Riedel, E., and Ailamaki, A. 2004. Diamond: A storage architecture for early discard in interactive search. In Proceedings of the USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Iyer, S. and Druschel, P. 2001. Anticipatory scheduling: A disk scheduling framework to overcome deceptive idleness in synchronous I/O. In Proceedings of the Symposium on Operating Systems Principles. 117--130. Google Scholar
Digital Library
- Jagadish, H. V., Al-Khalifa, S., Chapman, A., Lakshmanan, L. V. S., Nierman, A., et al. 2002. TIMBER: A native XML database. VLDB J. 11, 4, 274--291. Google Scholar
Digital Library
- Kanne, C. and Moerkotte, G. 1999. Efficient storage of XML data. Tech. rep., Universitaet Mannheim.Google Scholar
- Kanne, C.-C., Brantner, M., and Moerkotte, G. 2005. Cost-Sensitive reordering of navigational primitives. In Proceedings of the ACM SIGMOD International Conference on Management Data. Google Scholar
Digital Library
- Kanne, C.-C. and Moerkotte, G. 2006. A linear time algorithm for optimal tree sibling partitioning and approximation algorithms in Natix. In Proceedings of the International Conference on Very Large Databases. Google Scholar
Digital Library
- Kaushik, R., Bohannon, P., Naughton, J. F., and Korth, H. F. 2002. Covering indexes for branching path queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google Scholar
Digital Library
- Keeton, K., Patterson, D. A., and Hellerstein, J. M. 1998. A case for intelligent disks (IDISKS). SIGMOD Rec. 27, 3, 42--52. Google Scholar
Digital Library
- Kundu, S. and Misra, J. 1977. A linear tree partition algorithm. SIAM J. Comput. 6, 1,151--154.Google Scholar
Cross Ref
- Li, Q. and Moon, B. 2001. Indexing and querying XML data for regular path expressions. VLDB J. Google Scholar
Digital Library
- Manolescu, I., Miachon, C., and Michiels, P. 2006. Towards micro-benchmarking Xquery. In Proceedings of the International Workshop on Performance and Evaluation of Data Management Systems (ExpDB), 28--39.Google Scholar
- McHugh, J., Abiteboul, S., Goldman, R., Quass, D., and Widom, J. 1997. Lore: A database management system for semistructured data. SIGMOD Rec. 26, 3, 54--66. Google Scholar
Digital Library
- Meng, X., Luo, D., Lee, M.-L., and An, J. 2003. Orientstore: A schema based native XML storage system. In Proceedings of the International Conference on Very Large Databases, 1057--1060. Google Scholar
Digital Library
- Mergen, S. L. S. and Heuser, C. A. 2004. Matching of XML schemas and relational schemas. In Proceedings of the Brazilian Symposium on Databases (SBBD).Google Scholar
- MML. 2008. Medical markup language. http://www.ncbi.nlm.nih.gov/.Google Scholar
- Nambiar, U., Lacroix, Z., Bressan, S., Lee, M., and Li, Y. 2001. XML benchmarks put to the test. http://www.comp.nus.edu.sg/~liyg/publication/iiwas01.pdf.Google Scholar
- Nekrestyanov, I., Novikov, B., and Pavlova, E. 2000. An analysis of alternative methods for storing semistructured data in relations. In Proceedings of the East European Conference on Advances in Databases and Information Systems -- International Conference on Database Systems for Advanced Applications (ADBIS-DASFAA). 354--361. Google Scholar
Digital Library
- Nicola, M. and John, J. 2003. XML parsing: A threat to database performance. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM). 175--178. Google Scholar
Digital Library
- Nicola, M., Kogan, I., and Schiefer, B. 2007. An XML transaction processing benchmark. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 937--948. Google Scholar
Digital Library
- Noga, M. L., Schott, S., and Lowe, W. 2002. Lazy XML processing. In Proceedings of the ACM Symposium on Document Engineering. 88--94. Google Scholar
Digital Library
- ODS. 2008. Open document specification v1.0. http://www.oasis-open.org/committees/download.php/12572/OpenDocument-v1.0-os.pdf.Google Scholar
- OOX. 2008. Openoffice XML file format v1.0.Google Scholar
- Papakonstantinou, Y., Garcia-Molina, H., and Widom, J. 1995. Object exchange across heterogeneous information sources. In Proceedings of the 11th International Conference on Data Engineering. Google Scholar
Digital Library
- Ramanath, M., Freire, J., Haritsa, J., and Roy, P. 2003. Searching for efficient XML to relational mappings. Tech. rep. TR-2003-01, DSL/SERC.Google Scholar
- Riedel, E., Gibson, G., and Faloutsos, C. 1998. Active storage for large-scale data mining and multimedia. In Proceedings of the International Conference on Very Large Databases. Google Scholar
Digital Library
- Rokhsar, D. 2007. Computational analysis of genomic sequence data. http://www.nersc.gov/news/annual reports/annrep01/sh BER 06.html.Google Scholar
- Ruemmler, C. and Wilkes, J. 1994. An introduction to disk drive modeling. Comput. 27, 3, 17--28. Google Scholar
Digital Library
- Runapongsa, K., Patel, J., Jagadish, H., Chen, Y., and Al-Khalifa, S. 2003. The Michigan benchmark: Towards XML query performance diagnostics. Inf. Syst. J. 31, 2, 73--97.Google Scholar
Cross Ref
- Schindler, J., Schlosser, S. W., Shao, M., Ailamaki, A., and Ganger, G. R. 2004. Atropos: A disk array volume manager for orchestrated use of disks. In Proceedings of the USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Schlosser, S. W., Schindler, J., Papadomanolakis, S., Shao, M., Ailamaki, A., Faloutsos, C., and Ganger, G. R. 2005. On multidimensional fata and modern disks. In Proceedings of the 4th USENIX Conference on File and Storage Technology. Google Scholar
Digital Library
- Schmidt, A., Waas, F., Kersten, M., Carey, M., Manolescu, I., and Busse, R. 2002a. Xmark: A benchmark for XML data management. In Proceedings of the 28th International Conference on Very Large Databases. 974--985. Google Scholar
Digital Library
- Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., DeWitt, D. J., and Naughton, J. F. 1999. Relational databases for querying XML documents: Limitations and opportunities. In Proceedings of the International Conference on Very Large Databases. Google Scholar
Digital Library
- Sivathanu, M., Prabhakaran, V., Popovici, F. I., Denehy, T. E., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2003. Semantically-smart disk systems. In Proceedings of the USENIX Symposium on File and Storage Technologies. 73--88. Google Scholar
Digital Library
- SVG. 2008. Scalable vector graphics. http://www.w3.org/Graphics/SVG/.Google Scholar
- Talagala, N., Arpaci-Dusseau, R. H., and Patterson, D. 1999. Micro-benchmark-Based extraction of local and global disk characteristics. Tech. rep., University of California at Berkeley. Google Scholar
Digital Library
- Worthington, B., Ganger, G., Patt, Y., and Wilkes, J. 1995. Online extraction of SCSI disk drive parameters. In Proceedings of ACM SIGMETRICS Conference. 146--156. Google Scholar
Digital Library
- Xalan. 2007. Xalan-Java. http://xml.apache.org/xalan-j.Google Scholar
- XPath. 2007. XML path language (XPath) version 1.0. http://www.w3.org/TR/xpath.Google Scholar
- XT. 2007. XT homepage. http://www.blnz.com/xt/index.html.Google Scholar
- Yao, B. B., Özsu, M. T., and Keenleyside, J. 2003. Xbench - A family of benchmarks for XML dbmss. In Proceedings of the VLDB Workshop EEXTT and CAiSE Workshop DTWeb on Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web-Revised Papers. Springer-Verlag, 162--164. Google Scholar
Digital Library
Index Terms
Storing semi-structured data on disk drives
Recommendations
JSON Tiles: Fast Analytics on Semi-Structured Data
SIGMOD '21: Proceedings of the 2021 International Conference on Management of DataDevelopers often prefer flexibility over upfront schema design, making semi-structured data formats such as JSON increasingly popular. Large amounts of JSON data are therefore stored and analyzed by relational database systems. In existing systems, ...
Reformulating XPath queries and XSLT queries on XSLT views
Applications using XML for data representation very often use different XML formats and thus require the transformation of XML data. The common approach transforms entire XML documents from one format into another, e.g. by using an XSLT stylesheet. ...
Data Warehouse Based Approach to the Integration of Semi-structured Data
Advances in Web and Network Technologies, and Information ManagementSemi-structured data play an increasing role in the development of the web through the use of XML. However, the management of semi-structured data poses specific problems because semi-structured data, contrary to classical database, do not rely on a ...






Comments