10.1145/1458082.1458255acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Identifying table boundaries in digital documents via sparse line detection

Authors Info & Claims
Online:26 October 2008Publication History

ABSTRACT

Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an article is presented in tabular form in a digital document. If the data reported in tables can be extracted and stored in a database, the data can be queried and joined with other data using database management systems. In order to prepare the data source for table search, accurately detecting the table boundary plays a crucial role for the later table structure decomposition. Table boundary detection and content extraction is a challenging problem because tabular formats are not standardized across all documents. In this paper, we propose a simple but effective preprocessing method to improve the table boundary detection performance by considering the sparse-line property of table rows. Our method easily simplifies the table boundary detection problem into the sparse line analysis problem with much less noise. We design eight line label types and apply two machine learning techniques, Conditional Random Field (CRF) and Support Vector Machines (SVM), on the table boundary detection field. The experimental results not only compare the performances between the machine learning methods and the heuristics-based method, but also demonstrate the effectiveness of the sparse line analysis in the table boundary detection.

References

  1. C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. H. Chao and J. Fan. Layout and content extraction for pdf documents. pages 213--224, 2004.Google ScholarGoogle Scholar
  3. S. T. H. Chen and J. Tsai. Mining tables from large scale html texts. In In Proc. 18th Int'l Conf. Computational Liguistics, Saarbrucken, Germany, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Ha, R. Haralick, and I. Philips. Recursive x-y cut using bounding boxes of connected components. In In Proc. Third Int'l Conf. Document Analysis and Recognition, pages 952--955, 1955. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Hurst. Layout and language: Challenges for table understanding on the web, 2001.Google ScholarGoogle Scholar
  6. N. G. J. Shin. Table recognition and evaluation. In In Proc. of the Class of 2005 Senior Conf., Computer Science Department, Swarthmore College, pages 8--13, 2005.Google ScholarGoogle Scholar
  7. T. Joachims. Svm light. http://svmlight.joachims.org/.Google ScholarGoogle Scholar
  8. T. Kieninger and A. Dengel. Applying the t-rec table recognition system to the business letter domain. In In Proc. of the 6th Int'l Conf. on Document Analysis and Recognition, pages 518--522, September 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. G. Kieninger. Table structure recognition based on robust block segmentation. In In Proc. Document Recognition V, SPIE, volume 3305, pages 22--32, January 1998.Google ScholarGoogle ScholarCross RefCross Ref
  10. B. Krupl, M. Herzog, and W. Gatterbauer. Using visual cues for extraction of tabular data from arbitrary html documents. In In Proc. of the 14th Int'l Conf. on World Wide Web, pages 1000--1001, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th ICML, pages 282--289. Morgan Kaufmann, San Francisco, CA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Liu, K. Bai, P. Mitra, and C. L. Giles. Tableseer: automatic table metadata extraction and searching in digital libraries. In JCDL, pages 91--100, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Liu, P. Mitra, and C. L. Giles. Improving the table boundary detection in pdfs by fixing the sequence error of the sparse lines. In Technical report, 2008.Google ScholarGoogle Scholar
  14. A. McCallum. Efficiently inducing features of conditional random fields. In Nineteenth Conference on UAI, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, 2003.Google ScholarGoogle Scholar
  16. H. Ng, C. Lim, and J. Koo. Learning to recognize tables in free text, 1999.Google ScholarGoogle Scholar
  17. H. Ng, C. Y. Lim, and J. T. Koo. Learning to recognize tables in free text. In In Proc. of the 37th Annual Meeting of the Association of Computational Linguistics on Computational Linguistics, pages 443--450, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Penn, J. Hu, H. Luo, and R. McDonald. Flexible web document analysis for delivery to narrow-bandwidth devices, 2001.Google ScholarGoogle Scholar
  19. D. Pinto, A. McCallum, X. Wei, and W. Bruce. Table extraction using conditional random fields. In In proceeding of Proceedings of the 26th ACM SIGIR, Toronto, Canada, July 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Safavian and D. Landgrebe. A survey of decision tree classifier methodology. In SMC(21), No. 3, May 1991, pp. 660--674.Google ScholarGoogle Scholar
  21. F. Sha and F. Pereira. Shallow parsing with conditional random fields, 2003.Google ScholarGoogle Scholar
  22. J. Shamilian, H. Baird, and T. Wood. A retamgetable table reader. In In Proc. of the 4th Int'l Conf. on Document Analysis and Recognition, pages 158--163, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Wang and J. Hu. A machine learning based approach for table detection on the web. In WWW'02, pages 242--250, Nov 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Wang and J. Hu. Detecting tables in html documents. In In Proc. of the 5th IAPR DAS, Princeton, NJ, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Wang, I. Philips, and R. Haralick. Automatic table ground truth generation and a background-analysis-based table structure extraction method. In In Proc. of the 6th Int'l Conference on Document Analysis and Recognition, page 528, September 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Yildiz, K. Kaiser, and S. Miksch. pdf2table: A >method to extract table information from pdf files. IICAI05, (Pune, India), 2005.Google ScholarGoogle Scholar
  27. M. Yoshida, K. Torisawa, and J. Tsujii. A method to integrate tables of the world wide web, 2001.Google ScholarGoogle Scholar
  28. R. Zanibbi, D. Blostein, and J. Cordy. A survey of table recognition: Models, observations, transformations, and inferences. In Int'l J. Document Analysis and Recognition, Vol. 7, No.1, pages 1--16, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Z. Zheng. Naive bayesian classifier committees. In European Conference on Machine Learning, pages 196--207, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Identifying table boundaries in digital documents via sparse line detection

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!