Abstract
Data repositories often consist of text files in a wide variety of standard formats, ad-hoc formats, as well as mixtures of formats where data in one format is embedded into a different format. It is therefore a significant challenge to parse these files into a structured tabular form, which is important to enable any downstream data processing.
We present Unravel, an extensible framework for structure interpretation of ad-hoc formats. Unravel can automatically, with no user input, extract tabular data from a diverse range of standard, ad-hoc and mixed format files. The framework is also easily extensible to add support for previously unseen formats, and also supports interactivity from the user in terms of examples to guide the system when specialized data extraction is desired. Our key insight is to allow arbitrary combination of extraction and parsing techniques through a concept called partial structures. Partial structures act as a common language through which the file structure can be shared and refined by different techniques. This makes Unravel more powerful than applying the individual techniques in parallel or sequentially. Further, with this rule-based extensible approach, we introduce the novel notion of re-interpretation where the variety of techniques supported by our system can be exploited to improve accuracy while optimizing for particular quality measures or restricted environments. On our benchmark of 617 text files gathered from a variety of sources, Unravel is able to extract the intended table in many more cases compared to state-of-the-art techniques.
Supplemental Material
- Arvind Arasu and Hector Garcia-Molina. 2003. Extracting Structured Data from Web Pages. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, California) ( SIGMOD '03). ACM, New York, NY, USA, 337-348. https://doi.org/10.1145/872757.872799 Google Scholar
Digital Library
- Sarah Chasins and Rastislav Bodik. 2017. Skip Blocks: Reusing Execution History to Accelerate Web Scripts. Proc. ACM Program. Lang. 1, OOPSLA, Article 51 (Oct. 2017 ), 28 pages. https://doi.org/10.1145/3133875 Google Scholar
Digital Library
- Sarah E. Chasins, Maria Mueller, and Rastislav Bodik. 2018. Rousillon: Scraping Distributed Hierarchical Web Data. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (Berlin, Germany) ( UIST '18). Association for Computing Machinery, New York, NY, USA, 963-975. https://doi.org/10.1145/3242587.3242661 Google Scholar
Digital Library
- Cognos Analytics 2019. Cognos Analytics: How XML files are flattened. https://www.ibm.com/support/knowledgecenter/ en/SSEP7J_10.2.2/com.ibm. swg.ba.cognos.dg_rtm_wb.10.2.2.doc/c_howxmlfilesareflattenednd09ab.html. Accessed: 2019-11-20.Google Scholar
- Patrick Cousot and Radhia Cousot. 1977. Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints. In Conference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, California, USA, January 1977, Robert M. Graham, Michael A. Harrison, and Ravi Sethi (Eds.). ACM, 238-252. https://doi.org/10.1145/512950.512973 Google Scholar
Digital Library
- Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. 2001. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 109-118. http://dl.acm.org/citation.cfm?id= 645927. 672370Google Scholar
- Allen Cypher, Daniel C. Halbert, David Kurlander, Henry Lieberman, David Maulsby, Brad A. Myers, and Alan Turransky (Eds.). 1993. Watch what I do: programming by demonstration. MIT Press, Cambridge, MA, USA. http://portal.acm.org/ citation.cfm?id= 168080Google Scholar
- Mark Daly, Yitzhak Mandelbaum, David Walker, Mary Fernández, Kathleen Fisher, Robert Gruber, and Xuan Zheng. 2006. PADS: An End-to-end System for Processing Ad Hoc Data. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (Chicago, IL, USA) ( SIGMOD '06). ACM, New York, NY, USA, 727-729. https: //doi.org/10.1145/1142473.1142568 Google Scholar
Digital Library
- Data Miner 2019. Data Miner: Extract data from any website with 1 click. https://data-miner.io/. Accessed: 2019-11-20.Google Scholar
- M. Du and F. Li. 2016. Spell: Streaming Parsing of System Event Logs. In 2016 IEEE 16th International Conference on Data Mining (ICDM). 859-864. https://doi.org/10.1109/ICDM. 2016.0103 Google Scholar
Cross Ref
- ELK 2019. ELK. https://www.elastic.co/what-is/elk-stack. Accessed: 2019-11-20.Google Scholar
- Kathleen Fisher and Robert Gruber. 2005. PADS: A Domain-specific Language for Processing Ad Hoc Data. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (Chicago, IL, USA) ( PLDI '05). ACM, New York, NY, USA, 295-304. https://doi.org/10.1145/1065010.1065046 Google Scholar
Digital Library
- Kathleen Fisher and David Walker. 2011. The PADS Project: An Overview. In Proceedings of the 14th International Conference on Database Theory (Uppsala, Sweden) ( ICDT '11). ACM, New York, NY, USA, 11-17. https://doi.org/10.1145/1938551.1938556 Google Scholar
Digital Library
- Kathleen Fisher, David Walker, Kenny Qili Zhu, and Peter White. 2008. From dirt to shovels: fully automatic tool generation from ad hoc data.. In POPL, George C. Necula and Philip Wadler (Eds.). ACM, 421-434. http://dblp.uni-trier.de/db/conf/ popl/popl2008.html#FisherWZW08Google Scholar
- Yihan Gao, Silu Huang, and Aditya G. Parameswaran. 2018. Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets.. In SIGMOD Conference, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 943-958. http://dblp.uni-trier.de/db/conf/sigmod/sigmod2018.html#GaoHP18Google Scholar
- Pankaj Gulhane, Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham, Rajeev Rastogi, Sandeep Satpal, Srinivasan H. Sengamedu, Ashwin Tengli, and Charu Tiwari. 2011. Web-scale Information Extraction with Vertex. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering (ICDE '11). IEEE Computer Society, Washington, DC, USA, 1209-1220. https://doi.org/10.1109/ICDE. 2011.5767842 Google Scholar
Digital Library
- Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, and Jefrey Heer. 2011. Proactive Wrangling: Mixed-initiative End-user Programming of Data Transformation Scripts. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (Santa Barbara, California, USA) ( UIST '11). ACM, New York, NY, USA, 65-74. https://doi.org/ 10.1145/2047196.2047205 Google Scholar
Digital Library
- Hossein Hamooni, Biplob Debnath, Jianwu Xu, Hui Zhang, Guofei Jiang, and Abdullah Mueen. 2016. LogMine: Fast Pattern Recognition for Log Analytics. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (Indianapolis, Indiana, USA) ( CIKM '16). ACM, New York, NY, USA, 1573-1582. https: //doi.org/10.1145/2983323.2983358 Google Scholar
Digital Library
- Json Normalize 2019. pandas.io.json.json_normalize. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io. json.json_normalize.html/. Accessed: 2019-11-20.Google Scholar
- Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jefrey Heer. 2011a. Wrangler: Interactive Visual Specification of Data Transformation Scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Vancouver,Google Scholar
Digital Library
Index Terms
Structure interpretation of text formats
Recommendations
Nodose version 2.0
This paper describes a tool, called Nodose, we have developed to expedite the creation of robust wrappers. Nodose allows non-programmers to build components that can convert data from the source format to XML or another generic format. Further, the ...
Open Metadata Formats: Efficient XML-Based Communication for High Performance Computing
HPDC '01: Proceedings of the 10th IEEE International Symposium on High Performance Distributed ComputingAbstract: High-performance computing faces considerable change as the Internet and the Grid mature. Applications that once were tightly-coupled and monolithic are now decentralized, with collaborating components spread across diverse computational ...
On the complexity of schema inference from web pages in the presence of nullable data attributes
CIKM '03: Proceedings of the twelfth international conference on Information and knowledge managementAn increasingly large number of Web pages are machine-generated by filling in templates with data stored in backend databases. These templates can be viewed as the implicit schemas of those Web pages. The ability to infer the implicit schema from a ...






Comments