ABSTRACT
The mathematical contents of scientific publications in PDF format cannot be easily analyzed by regular PDF parsers and OCR tools. In this paper, we propose a novel OCR system called PDF2LaTeX, which extracts math expressions and text in both postscript and image-based PDF files and translates them into LaTeX markup. As a preprocessing step, PDF2LaTeX first renders a PDF file into its image format, and then uses projection profile cutting (PPC) to analyze the page layout. The analysis of math expressions and text is based on a series of deep learning algorithms. First, it uses a convolutional neural network (CNN) as a binary classifier to detect math image blocks based on visual features. Next, it uses a conditional random field (CRF) to detect math-text boundaries by incorporating semantics and context information. In the end, the system uses two different models based on a CNN-LSTM neural network architecture to translate image blocks of math expressions and plaintext into the LaTeX representations. For testing, we created a new dataset composed of 102 PDF pages collected from publications on arXiv.org and compared the performance between PDF2LaTeX and the state-of-the-art commercial software InftyReader. The experiment results showed that the proposed system achieved a better recognition accuracy (81.1%) measured by the string edit distance between the predicted LaTeX and the ground truth.
- A. E. Jinha, "Article 50 million: an estimate of the number of scholarly articles in existence," Learned Publishing, vol. 23, no. 3, pp. 258--263, 2010.Google Scholar
- (2019, Feb 20). arXiv submission rate statistics. Available: https://arxiv.org/help/stats/2019_by_areaGoogle Scholar
- (2019, Feb 7th). PDFMiner. Available: https://pypi.org/project/pdfminer/Google Scholar
- "Document management - Portable document format - Part 1: PDF 1.7," Adobe Systems Incorporated, p. 242, July 1st 2008.Google Scholar
- Z. Wang, D. Beyette, J. Lin, and J.-C. Liu, "Extraction of Math Expressions from PDF Documents based on Unsupervised Modeling of Fonts," in IAPR International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 2019: IEEE.Google Scholar
- R. Zanibbi and D. Blostein, "Recognition and retrieval of mathematical expressions," International Journal on Document Analysis and Recognition (IJDAR), vol. 15, no. 4, pp. 331--357, 2012.Google Scholar
Digital Library
- (2020, Feb 20th). NIST Digital Library of Mathematical Functions. Available: https://dlmf.nist.gov/Google Scholar
- (2020, Feb 20th). Wolfram Functions Site. Available: http://functions.wolfram.com/Google Scholar
- P. Ion, R. Miner, S. Buswell, and A. Devitt, Mathematical Markup Language (MathML) 1.0 Specification. World Wide Web Consortium (W3C), 1998.Google Scholar
- E. Foulke, "Reading braille," Tactual perception: A sourcebook, vol. 168, 1982.Google Scholar
- (2020, Feb 7th). Apache PDFBox. Available: https://pdfbox.apache.org/Google Scholar
- S. Singh, "Optical character recognition techniques: a survey," Journal of emerging Trends in Computing and information Sciences, vol. 4, no. 6, pp. 545--550, 2013.Google Scholar
- K. Ashida, M. Okamoto, H. Imai, and T. Nakatsuka, "Performance evaluation of a mathematical formula recognition system with a large scale of printed formula images," in Second International Conference on Document Image Analysis for Libraries (DIAL'06), 2006, pp. 12 pp. 331: IEEE.Google Scholar
- M. Suzuki, F. Tamari, R. Fukuda, S. Uchida, and T. Kanahori, "INFTY: an integrated OCR system for mathematical documents," in Proceedings of the 2003 ACM symposium on Document engineering, 2003, pp. 95--104: ACM.Google Scholar
- Y. Deng, A. Kanervisto, J. Ling, and A. M. Rush, "Image-to-markup generation with coarse-to-fine attention," in Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017, pp. 980--989: JMLR. org.Google Scholar
- Z. Wang and J.-C. Liu, "Translating Mathematical Formula Images to LaTeX Sequences Using Deep Neural Networks with Sequence-level Training," arXiv preprint arXiv: 1908.11415, 2019.Google Scholar
- X. Wang and J.-C. Liu, "A Font Setting Based Bayesian Model to Extract Mathematical Expression in PDF Files," in 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017, vol. 1, pp. 759--764: IEEE.Google Scholar
- X. Wang, Z. Wang, and J.-C. Liu, "Bigram Label Regularization to Reduce Over-Segmentation on Inline Math Expression Detection," in IAPR International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 2019: IEEE.Google Scholar
- K. Iwatsuki, T. Sagara, T. Hara, and A. Aizawa, "Detecting In-line Mathematical Expressions in Scientific Documents," in Proceedings of the 2017 ACM Symposium on Document Engineering, 2017, pp. 141--144: ACM.Google Scholar
- J. B. Baker, A. P. Sexton, and V. Sorge, "A linear grammar approach to mathematical formula recognition from PDF," in International Conference on Intelligent Computer Mathematics, 2009, pp. 201--216: Springer.Google Scholar
Digital Library
- J. B. Baker, A. P. Sexton, and V. Sorge, "Faithful mathematical formula recognition from PDF documents," in Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, 2010, pp. 485--492: ACM.Google Scholar
- L. Gao, X. Yi, Y. Liao, Z. Jiang, Z. Yan, and Z. Tang, "A Deep Learning-Based Formula Detection Method for PDF Documents," in 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017, pp. 553--558: IEEE.Google Scholar
- Y. Deng, A. Kanervisto, and A. M. Rush, "What you get is what you see: A visual markup decompiler," arXiv preprint arXiv:1609.04938, vol. 10, pp. 32--37, 2016.Google Scholar
- R. Smith, "An overview of the Tesseract OCR engine," in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), 2007, vol. 2, pp. 629--633: IEEE.Google Scholar
- M. O. Perez-Arriaga, T. Estrada, and S. Abad-Mota, "TAO: system for table detection and extraction from PDF documents," in The Twenty-Ninth International Flairs Conference, 2016.Google Scholar
- S. Schreiber, S. Agne, I. Wolf, A. Dengel, and S. Ahmed, "Deepdesrt: Deep learning for detection and structure recognition of tables in document images," in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017, vol. 1, pp. 1162--1167: IEEE.Google Scholar
- C. Clark and S. Divvala, "Looking beyond text: Extracting figures, tables and captions from computer science papers.. 2015," in AAAI 2015 Workshop on Scholarly Big Data, 2015.Google Scholar
- R. Saha, A. Mondal, and C. Jawahar, "Graphical Object Detection in Document Images," in 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, pp. 51--58: IEEE.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012, pp. 1097--1105.Google Scholar
- J. Gehrke, P. Ginsparg, and J. Kleinberg, "Overview of the 2003 KDD Cup," Acm Sigkdd Explorations Newsletter, vol. 5, no. 2, pp. 149--151, 2003.Google Scholar
Digital Library
- B. Miller, "LaTeXML: A Latex to XML Converter. url: https://dlmf.nist.gov/LaTeXML/," LaTeXML/(visited on 03/03/2020).Google Scholar
- (2020, March 3). TeX Live. Available: https://www.tug.org/texlive/Google Scholar
- M. Lin, Q. Chen, and S. Yan, "Network in network," arXiv preprint arXiv:1312.4400, 2013.Google Scholar
- B. Shi, X. Bai, and C. Yao, "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 11, pp. 2298--2304, 2016.Google Scholar
- A. Paszke et al., "Automatic differentiation in pytorch," 2017.Google Scholar
- D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.Google Scholar
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfitting," The journal of machine learning research, vol. 15, no. 1, pp. 1929--1958, 2014.Google Scholar
- M. Korobov, "sklearn-crfsuite (2015)," ed, 2019.Google Scholar
- A. Vaswani et al., "Attention is all you need," in Advances in neural information processing systems, 2017, pp. 5998--6008.Google Scholar
- S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735--1780, 1997.Google Scholar
Digital Library
- M.-T. Luong, H. Pham, and C. D. Manning, "Effective approaches to attention-based neural machine translation," arXiv preprint arXiv:1508.04025, 2015.Google Scholar
- R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, "Policy gradient methods for reinforcement learning with function approximation," in Advances in neural information processing systems, 2000, pp. 1057--1063.Google Scholar
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "BLEU: a method for automatic evaluation of machine translation," in Proceedings of the 40th annual meeting on association for computational linguistics, 2002, pp. 311--318: Association for Computational Linguistics.Google Scholar
- A. Graves, "Sequence transduction with recurrent neural networks," arXiv preprint arXiv:1211.3711, 2012.Google Scholar
- (2019, Aug 25). KaTex. Available: https://katex.org/Google Scholar
Index Terms
PDF2LaTeX: A Deep Learning System to Convert Mathematical Documents from PDF to LaTeX
Recommendations
PageNet: Page Boundary Extraction in Historical Handwritten Documents
HIP '17: Proceedings of the 4th International Workshop on Historical Document Imaging and ProcessingWhen digitizing a document into an image, it is common to include a surrounding border region to visually indicate that the entire document is present in the image. However, this border should be removed prior to automated processing. In this work, we ...
GFTE: Graph-Based Financial Table Extraction
Pattern Recognition. ICPR International Workshops and ChallengesAbstractTabular data is a crucial form of information expression, which can organize data in a standard structure for easy information retrieval and comparison. However, in financial industry and many other fields, tables are often disclosed in ...
Multi-page document analysis based on format consistency and clustering
In multi-page documents, document elements belonging to the same component usually share format regularity. We call this regularity 'document component intrinsic format consistency' (DCIFC). We present a new document analysis method based on DCIFC, ...




Comments