ABSTRACT
Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an article is presented in tabular form in a digital document. If the data reported in tables can be extracted and stored in a database, the data can be queried and joined with other data using database management systems. In order to prepare the data source for table search, accurately detecting the table boundary plays a crucial role for the later table structure decomposition. Table boundary detection and content extraction is a challenging problem because tabular formats are not standardized across all documents. In this paper, we propose a simple but effective preprocessing method to improve the table boundary detection performance by considering the sparse-line property of table rows. Our method easily simplifies the table boundary detection problem into the sparse line analysis problem with much less noise. We design eight line label types and apply two machine learning techniques, Conditional Random Field (CRF) and Support Vector Machines (SVM), on the table boundary detection field. The experimental results not only compare the performances between the machine learning methods and the heuristics-based method, but also demonstrate the effectiveness of the sparse line analysis in the table boundary detection.
References
- C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, 1998. Google Scholar
Digital Library
- H. Chao and J. Fan. Layout and content extraction for pdf documents. pages 213--224, 2004.Google Scholar
- S. T. H. Chen and J. Tsai. Mining tables from large scale html texts. In In Proc. 18th Int'l Conf. Computational Liguistics, Saarbrucken, Germany, 2000. Google Scholar
Digital Library
- J. Ha, R. Haralick, and I. Philips. Recursive x-y cut using bounding boxes of connected components. In In Proc. Third Int'l Conf. Document Analysis and Recognition, pages 952--955, 1955. Google Scholar
Digital Library
- M. Hurst. Layout and language: Challenges for table understanding on the web, 2001.Google Scholar
- N. G. J. Shin. Table recognition and evaluation. In In Proc. of the Class of 2005 Senior Conf., Computer Science Department, Swarthmore College, pages 8--13, 2005.Google Scholar
- T. Joachims. Svm light. http://svmlight.joachims.org/.Google Scholar
- T. Kieninger and A. Dengel. Applying the t-rec table recognition system to the business letter domain. In In Proc. of the 6th Int'l Conf. on Document Analysis and Recognition, pages 518--522, September 2001. Google Scholar
Digital Library
- T. G. Kieninger. Table structure recognition based on robust block segmentation. In In Proc. Document Recognition V, SPIE, volume 3305, pages 22--32, January 1998.Google Scholar
Cross Ref
- B. Krupl, M. Herzog, and W. Gatterbauer. Using visual cues for extraction of tabular data from arbitrary html documents. In In Proc. of the 14th Int'l Conf. on World Wide Web, pages 1000--1001, 2005. Google Scholar
Digital Library
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th ICML, pages 282--289. Morgan Kaufmann, San Francisco, CA, 2001. Google Scholar
Digital Library
- Y. Liu, K. Bai, P. Mitra, and C. L. Giles. Tableseer: automatic table metadata extraction and searching in digital libraries. In JCDL, pages 91--100, 2007. Google Scholar
Digital Library
- Y. Liu, P. Mitra, and C. L. Giles. Improving the table boundary detection in pdfs by fixing the sequence error of the sparse lines. In Technical report, 2008.Google Scholar
- A. McCallum. Efficiently inducing features of conditional random fields. In Nineteenth Conference on UAI, 2003. Google Scholar
Digital Library
- A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, 2003.Google Scholar
- H. Ng, C. Lim, and J. Koo. Learning to recognize tables in free text, 1999.Google Scholar
- H. Ng, C. Y. Lim, and J. T. Koo. Learning to recognize tables in free text. In In Proc. of the 37th Annual Meeting of the Association of Computational Linguistics on Computational Linguistics, pages 443--450, 1999. Google Scholar
Digital Library
- G. Penn, J. Hu, H. Luo, and R. McDonald. Flexible web document analysis for delivery to narrow-bandwidth devices, 2001.Google Scholar
- D. Pinto, A. McCallum, X. Wei, and W. Bruce. Table extraction using conditional random fields. In In proceeding of Proceedings of the 26th ACM SIGIR, Toronto, Canada, July 2003. Google Scholar
Digital Library
- S. Safavian and D. Landgrebe. A survey of decision tree classifier methodology. In SMC(21), No. 3, May 1991, pp. 660--674.Google Scholar
- F. Sha and F. Pereira. Shallow parsing with conditional random fields, 2003.Google Scholar
- J. Shamilian, H. Baird, and T. Wood. A retamgetable table reader. In In Proc. of the 4th Int'l Conf. on Document Analysis and Recognition, pages 158--163, 1997. Google Scholar
Digital Library
- J. Wang and J. Hu. A machine learning based approach for table detection on the web. In WWW'02, pages 242--250, Nov 2002. Google Scholar
Digital Library
- Y. Wang and J. Hu. Detecting tables in html documents. In In Proc. of the 5th IAPR DAS, Princeton, NJ, 2002. Google Scholar
Digital Library
- Y. Wang, I. Philips, and R. Haralick. Automatic table ground truth generation and a background-analysis-based table structure extraction method. In In Proc. of the 6th Int'l Conference on Document Analysis and Recognition, page 528, September 2001. Google Scholar
Digital Library
- B. Yildiz, K. Kaiser, and S. Miksch. pdf2table: A >method to extract table information from pdf files. IICAI05, (Pune, India), 2005.Google Scholar
- M. Yoshida, K. Torisawa, and J. Tsujii. A method to integrate tables of the world wide web, 2001.Google Scholar
- R. Zanibbi, D. Blostein, and J. Cordy. A survey of table recognition: Models, observations, transformations, and inferences. In Int'l J. Document Analysis and Recognition, Vol. 7, No.1, pages 1--16, 2004. Google Scholar
Digital Library
- Z. Zheng. Naive bayesian classifier committees. In European Conference on Machine Learning, pages 196--207, 1998. Google Scholar
Digital Library
Index Terms
Identifying table boundaries in digital documents via sparse line detection





Comments