skip to main content
10.1145/3395027.3419580acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

PDF2LaTeX: A Deep Learning System to Convert Mathematical Documents from PDF to LaTeX

Authors Info & Claims
Published:29 September 2020Publication History

ABSTRACT

The mathematical contents of scientific publications in PDF format cannot be easily analyzed by regular PDF parsers and OCR tools. In this paper, we propose a novel OCR system called PDF2LaTeX, which extracts math expressions and text in both postscript and image-based PDF files and translates them into LaTeX markup. As a preprocessing step, PDF2LaTeX first renders a PDF file into its image format, and then uses projection profile cutting (PPC) to analyze the page layout. The analysis of math expressions and text is based on a series of deep learning algorithms. First, it uses a convolutional neural network (CNN) as a binary classifier to detect math image blocks based on visual features. Next, it uses a conditional random field (CRF) to detect math-text boundaries by incorporating semantics and context information. In the end, the system uses two different models based on a CNN-LSTM neural network architecture to translate image blocks of math expressions and plaintext into the LaTeX representations. For testing, we created a new dataset composed of 102 PDF pages collected from publications on arXiv.org and compared the performance between PDF2LaTeX and the state-of-the-art commercial software InftyReader. The experiment results showed that the proposed system achieved a better recognition accuracy (81.1%) measured by the string edit distance between the predicted LaTeX and the ground truth.

References

  1. A. E. Jinha, "Article 50 million: an estimate of the number of scholarly articles in existence," Learned Publishing, vol. 23, no. 3, pp. 258--263, 2010.Google ScholarGoogle Scholar
  2. (2019, Feb 20). arXiv submission rate statistics. Available: https://arxiv.org/help/stats/2019_by_areaGoogle ScholarGoogle Scholar
  3. (2019, Feb 7th). PDFMiner. Available: https://pypi.org/project/pdfminer/Google ScholarGoogle Scholar
  4. "Document management - Portable document format - Part 1: PDF 1.7," Adobe Systems Incorporated, p. 242, July 1st 2008.Google ScholarGoogle Scholar
  5. Z. Wang, D. Beyette, J. Lin, and J.-C. Liu, "Extraction of Math Expressions from PDF Documents based on Unsupervised Modeling of Fonts," in IAPR International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 2019: IEEE.Google ScholarGoogle Scholar
  6. R. Zanibbi and D. Blostein, "Recognition and retrieval of mathematical expressions," International Journal on Document Analysis and Recognition (IJDAR), vol. 15, no. 4, pp. 331--357, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. (2020, Feb 20th). NIST Digital Library of Mathematical Functions. Available: https://dlmf.nist.gov/Google ScholarGoogle Scholar
  8. (2020, Feb 20th). Wolfram Functions Site. Available: http://functions.wolfram.com/Google ScholarGoogle Scholar
  9. P. Ion, R. Miner, S. Buswell, and A. Devitt, Mathematical Markup Language (MathML) 1.0 Specification. World Wide Web Consortium (W3C), 1998.Google ScholarGoogle Scholar
  10. E. Foulke, "Reading braille," Tactual perception: A sourcebook, vol. 168, 1982.Google ScholarGoogle Scholar
  11. (2020, Feb 7th). Apache PDFBox. Available: https://pdfbox.apache.org/Google ScholarGoogle Scholar
  12. S. Singh, "Optical character recognition techniques: a survey," Journal of emerging Trends in Computing and information Sciences, vol. 4, no. 6, pp. 545--550, 2013.Google ScholarGoogle Scholar
  13. K. Ashida, M. Okamoto, H. Imai, and T. Nakatsuka, "Performance evaluation of a mathematical formula recognition system with a large scale of printed formula images," in Second International Conference on Document Image Analysis for Libraries (DIAL'06), 2006, pp. 12 pp. 331: IEEE.Google ScholarGoogle Scholar
  14. M. Suzuki, F. Tamari, R. Fukuda, S. Uchida, and T. Kanahori, "INFTY: an integrated OCR system for mathematical documents," in Proceedings of the 2003 ACM symposium on Document engineering, 2003, pp. 95--104: ACM.Google ScholarGoogle Scholar
  15. Y. Deng, A. Kanervisto, J. Ling, and A. M. Rush, "Image-to-markup generation with coarse-to-fine attention," in Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017, pp. 980--989: JMLR. org.Google ScholarGoogle Scholar
  16. Z. Wang and J.-C. Liu, "Translating Mathematical Formula Images to LaTeX Sequences Using Deep Neural Networks with Sequence-level Training," arXiv preprint arXiv: 1908.11415, 2019.Google ScholarGoogle Scholar
  17. X. Wang and J.-C. Liu, "A Font Setting Based Bayesian Model to Extract Mathematical Expression in PDF Files," in 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017, vol. 1, pp. 759--764: IEEE.Google ScholarGoogle Scholar
  18. X. Wang, Z. Wang, and J.-C. Liu, "Bigram Label Regularization to Reduce Over-Segmentation on Inline Math Expression Detection," in IAPR International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 2019: IEEE.Google ScholarGoogle Scholar
  19. K. Iwatsuki, T. Sagara, T. Hara, and A. Aizawa, "Detecting In-line Mathematical Expressions in Scientific Documents," in Proceedings of the 2017 ACM Symposium on Document Engineering, 2017, pp. 141--144: ACM.Google ScholarGoogle Scholar
  20. J. B. Baker, A. P. Sexton, and V. Sorge, "A linear grammar approach to mathematical formula recognition from PDF," in International Conference on Intelligent Computer Mathematics, 2009, pp. 201--216: Springer.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. B. Baker, A. P. Sexton, and V. Sorge, "Faithful mathematical formula recognition from PDF documents," in Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, 2010, pp. 485--492: ACM.Google ScholarGoogle Scholar
  22. L. Gao, X. Yi, Y. Liao, Z. Jiang, Z. Yan, and Z. Tang, "A Deep Learning-Based Formula Detection Method for PDF Documents," in 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017, pp. 553--558: IEEE.Google ScholarGoogle Scholar
  23. Y. Deng, A. Kanervisto, and A. M. Rush, "What you get is what you see: A visual markup decompiler," arXiv preprint arXiv:1609.04938, vol. 10, pp. 32--37, 2016.Google ScholarGoogle Scholar
  24. R. Smith, "An overview of the Tesseract OCR engine," in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), 2007, vol. 2, pp. 629--633: IEEE.Google ScholarGoogle Scholar
  25. M. O. Perez-Arriaga, T. Estrada, and S. Abad-Mota, "TAO: system for table detection and extraction from PDF documents," in The Twenty-Ninth International Flairs Conference, 2016.Google ScholarGoogle Scholar
  26. S. Schreiber, S. Agne, I. Wolf, A. Dengel, and S. Ahmed, "Deepdesrt: Deep learning for detection and structure recognition of tables in document images," in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017, vol. 1, pp. 1162--1167: IEEE.Google ScholarGoogle Scholar
  27. C. Clark and S. Divvala, "Looking beyond text: Extracting figures, tables and captions from computer science papers.. 2015," in AAAI 2015 Workshop on Scholarly Big Data, 2015.Google ScholarGoogle Scholar
  28. R. Saha, A. Mondal, and C. Jawahar, "Graphical Object Detection in Document Images," in 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, pp. 51--58: IEEE.Google ScholarGoogle Scholar
  29. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012, pp. 1097--1105.Google ScholarGoogle Scholar
  30. J. Gehrke, P. Ginsparg, and J. Kleinberg, "Overview of the 2003 KDD Cup," Acm Sigkdd Explorations Newsletter, vol. 5, no. 2, pp. 149--151, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. B. Miller, "LaTeXML: A Latex to XML Converter. url: https://dlmf.nist.gov/LaTeXML/," LaTeXML/(visited on 03/03/2020).Google ScholarGoogle Scholar
  32. (2020, March 3). TeX Live. Available: https://www.tug.org/texlive/Google ScholarGoogle Scholar
  33. M. Lin, Q. Chen, and S. Yan, "Network in network," arXiv preprint arXiv:1312.4400, 2013.Google ScholarGoogle Scholar
  34. B. Shi, X. Bai, and C. Yao, "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 11, pp. 2298--2304, 2016.Google ScholarGoogle Scholar
  35. A. Paszke et al., "Automatic differentiation in pytorch," 2017.Google ScholarGoogle Scholar
  36. D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.Google ScholarGoogle Scholar
  37. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfitting," The journal of machine learning research, vol. 15, no. 1, pp. 1929--1958, 2014.Google ScholarGoogle Scholar
  38. M. Korobov, "sklearn-crfsuite (2015)," ed, 2019.Google ScholarGoogle Scholar
  39. A. Vaswani et al., "Attention is all you need," in Advances in neural information processing systems, 2017, pp. 5998--6008.Google ScholarGoogle Scholar
  40. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735--1780, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M.-T. Luong, H. Pham, and C. D. Manning, "Effective approaches to attention-based neural machine translation," arXiv preprint arXiv:1508.04025, 2015.Google ScholarGoogle Scholar
  42. R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, "Policy gradient methods for reinforcement learning with function approximation," in Advances in neural information processing systems, 2000, pp. 1057--1063.Google ScholarGoogle Scholar
  43. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "BLEU: a method for automatic evaluation of machine translation," in Proceedings of the 40th annual meeting on association for computational linguistics, 2002, pp. 311--318: Association for Computational Linguistics.Google ScholarGoogle Scholar
  44. A. Graves, "Sequence transduction with recurrent neural networks," arXiv preprint arXiv:1211.3711, 2012.Google ScholarGoogle Scholar
  45. (2019, Aug 25). KaTex. Available: https://katex.org/Google ScholarGoogle Scholar

Index Terms

  1. PDF2LaTeX: A Deep Learning System to Convert Mathematical Documents from PDF to LaTeX

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        DocEng '20: Proceedings of the ACM Symposium on Document Engineering 2020
        September 2020
        130 pages
        ISBN:9781450380003
        DOI:10.1145/3395027

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 29 September 2020

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate157of475submissions,33%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader