skip to main content
research-article
Open Access

Structure interpretation of text formats

Published:13 November 2020Publication History
Skip Abstract Section

Abstract

Data repositories often consist of text files in a wide variety of standard formats, ad-hoc formats, as well as mixtures of formats where data in one format is embedded into a different format. It is therefore a significant challenge to parse these files into a structured tabular form, which is important to enable any downstream data processing.

We present Unravel, an extensible framework for structure interpretation of ad-hoc formats. Unravel can automatically, with no user input, extract tabular data from a diverse range of standard, ad-hoc and mixed format files. The framework is also easily extensible to add support for previously unseen formats, and also supports interactivity from the user in terms of examples to guide the system when specialized data extraction is desired. Our key insight is to allow arbitrary combination of extraction and parsing techniques through a concept called partial structures. Partial structures act as a common language through which the file structure can be shared and refined by different techniques. This makes Unravel more powerful than applying the individual techniques in parallel or sequentially. Further, with this rule-based extensible approach, we introduce the novel notion of re-interpretation where the variety of techniques supported by our system can be exploited to improve accuracy while optimizing for particular quality measures or restricted environments. On our benchmark of 617 text files gathered from a variety of sources, Unravel is able to extract the intended table in many more cases compared to state-of-the-art techniques.

Skip Supplemental Material Section

Supplemental Material

Auxiliary Presentation Video

Data repositories often consist of text files in a wide variety of standard formats, ad-hoc formats, as well as mixtures of formats where data in one format is embedded into a different format. It is therefore a significant challenge to parse these files into a structured tabular form, which is important to enable any downstream data processing. We present Unravel, an extensible framework for structure interpretation of ad-hoc formats. Unravel can automatically, with no user input, extract tabular data from a diverse range of standard, ad-hoc and mixed format files. The framework is also easily extensible to add support for previously unseen formats, and also supports interactivity from the user in terms of examples to guide the system when specialized data extraction is desired.

References

  1. Arvind Arasu and Hector Garcia-Molina. 2003. Extracting Structured Data from Web Pages. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, California) ( SIGMOD '03). ACM, New York, NY, USA, 337-348. https://doi.org/10.1145/872757.872799 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Sarah Chasins and Rastislav Bodik. 2017. Skip Blocks: Reusing Execution History to Accelerate Web Scripts. Proc. ACM Program. Lang. 1, OOPSLA, Article 51 (Oct. 2017 ), 28 pages. https://doi.org/10.1145/3133875 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Sarah E. Chasins, Maria Mueller, and Rastislav Bodik. 2018. Rousillon: Scraping Distributed Hierarchical Web Data. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (Berlin, Germany) ( UIST '18). Association for Computing Machinery, New York, NY, USA, 963-975. https://doi.org/10.1145/3242587.3242661 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cognos Analytics 2019. Cognos Analytics: How XML files are flattened. https://www.ibm.com/support/knowledgecenter/ en/SSEP7J_10.2.2/com.ibm. swg.ba.cognos.dg_rtm_wb.10.2.2.doc/c_howxmlfilesareflattenednd09ab.html. Accessed: 2019-11-20.Google ScholarGoogle Scholar
  5. Patrick Cousot and Radhia Cousot. 1977. Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints. In Conference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, California, USA, January 1977, Robert M. Graham, Michael A. Harrison, and Ravi Sethi (Eds.). ACM, 238-252. https://doi.org/10.1145/512950.512973 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. 2001. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 109-118. http://dl.acm.org/citation.cfm?id= 645927. 672370Google ScholarGoogle Scholar
  7. Allen Cypher, Daniel C. Halbert, David Kurlander, Henry Lieberman, David Maulsby, Brad A. Myers, and Alan Turransky (Eds.). 1993. Watch what I do: programming by demonstration. MIT Press, Cambridge, MA, USA. http://portal.acm.org/ citation.cfm?id= 168080Google ScholarGoogle Scholar
  8. Mark Daly, Yitzhak Mandelbaum, David Walker, Mary Fernández, Kathleen Fisher, Robert Gruber, and Xuan Zheng. 2006. PADS: An End-to-end System for Processing Ad Hoc Data. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (Chicago, IL, USA) ( SIGMOD '06). ACM, New York, NY, USA, 727-729. https: //doi.org/10.1145/1142473.1142568 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Data Miner 2019. Data Miner: Extract data from any website with 1 click. https://data-miner.io/. Accessed: 2019-11-20.Google ScholarGoogle Scholar
  10. M. Du and F. Li. 2016. Spell: Streaming Parsing of System Event Logs. In 2016 IEEE 16th International Conference on Data Mining (ICDM). 859-864. https://doi.org/10.1109/ICDM. 2016.0103 Google ScholarGoogle ScholarCross RefCross Ref
  11. ELK 2019. ELK. https://www.elastic.co/what-is/elk-stack. Accessed: 2019-11-20.Google ScholarGoogle Scholar
  12. Kathleen Fisher and Robert Gruber. 2005. PADS: A Domain-specific Language for Processing Ad Hoc Data. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (Chicago, IL, USA) ( PLDI '05). ACM, New York, NY, USA, 295-304. https://doi.org/10.1145/1065010.1065046 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kathleen Fisher and David Walker. 2011. The PADS Project: An Overview. In Proceedings of the 14th International Conference on Database Theory (Uppsala, Sweden) ( ICDT '11). ACM, New York, NY, USA, 11-17. https://doi.org/10.1145/1938551.1938556 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kathleen Fisher, David Walker, Kenny Qili Zhu, and Peter White. 2008. From dirt to shovels: fully automatic tool generation from ad hoc data.. In POPL, George C. Necula and Philip Wadler (Eds.). ACM, 421-434. http://dblp.uni-trier.de/db/conf/ popl/popl2008.html#FisherWZW08Google ScholarGoogle Scholar
  15. Yihan Gao, Silu Huang, and Aditya G. Parameswaran. 2018. Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets.. In SIGMOD Conference, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 943-958. http://dblp.uni-trier.de/db/conf/sigmod/sigmod2018.html#GaoHP18Google ScholarGoogle Scholar
  16. Pankaj Gulhane, Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham, Rajeev Rastogi, Sandeep Satpal, Srinivasan H. Sengamedu, Ashwin Tengli, and Charu Tiwari. 2011. Web-scale Information Extraction with Vertex. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE '11). IEEE Computer Society, Washington, DC, USA, 1209-1220. https://doi.org/10.1109/ICDE. 2011.5767842 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, and Jefrey Heer. 2011. Proactive Wrangling: Mixed-initiative End-user Programming of Data Transformation Scripts. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (Santa Barbara, California, USA) ( UIST '11). ACM, New York, NY, USA, 65-74. https://doi.org/ 10.1145/2047196.2047205 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hossein Hamooni, Biplob Debnath, Jianwu Xu, Hui Zhang, Guofei Jiang, and Abdullah Mueen. 2016. LogMine: Fast Pattern Recognition for Log Analytics. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (Indianapolis, Indiana, USA) ( CIKM '16). ACM, New York, NY, USA, 1573-1582. https: //doi.org/10.1145/2983323.2983358 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Json Normalize 2019. pandas.io.json.json_normalize. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io. json.json_normalize.html/. Accessed: 2019-11-20.Google ScholarGoogle Scholar
  20. Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jefrey Heer. 2011a. Wrangler: Interactive Visual Specification of Data Transformation Scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Vancouver,Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Structure interpretation of text formats

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the ACM on Programming Languages
          Proceedings of the ACM on Programming Languages  Volume 4, Issue OOPSLA
          November 2020
          3108 pages
          EISSN:2475-1421
          DOI:10.1145/3436718
          Issue’s Table of Contents

          Copyright © 2020 Owner/Author

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 November 2020
          Published in pacmpl Volume 4, Issue OOPSLA

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)124
          • Downloads (Last 6 weeks)17

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!