ABSTRACT

In this demo-paper we present Docear's PDF Inspector (DPI). DPI extracts titles from academic PDF files by applying a simple heuristic: the largest text on the first page of a PDF is assumed to be the title. This simple heuristic achieves accuracies around 70% and outperforms the tools ParsCit and SciPlore Xtract in both run-time and accuracy. In addition, DPI is released under the free open source license GPL 2+ at http://www.docear.org, written in JAVA, and runs on any major operating system.
References
- 1}Beel, J., Gipp, B., Langer, S. and Genzmehr, M. 2011. Docear: An Academic Literature Suite for Searching, Organizing and Creating Academic Literature. Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries (2011), 465--466. Google Scholar
Digital Library
- Beel, J., Gipp, B., Shaker, A. and Friedrich, N. 2010. SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size). Research and Advanced Technology for Digital Libraries, Proceedings of the 14th European Conference on Digital Libraries (ECDL'10) (Glasgow (UK), Sep. 2010), 413--416. Google Scholar
Digital Library
- Councill, I.G., Giles, C.L. and Kan, M.Y. 2008. ParsCit: An open-source CRF reference string parsing package. Proceedings of LREC (2008), 661--667.Google Scholar
- Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z. and Fox, E.A. 2003. Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries (2003), 37--48. Google Scholar
Digital Library
- Hu, Y., Li, H., Cao, Y., Teng, L., Meyerzon, D. and Zheng, Q. 2006. Automatic extraction of titles from general documents using machine learning. Information Processing and Management. 42, (2006), 1276--1293. Google Scholar
Digital Library
- Peng, F. and McCallum, A. 2004. Accurate information extraction from research papers using conditional random fields. HLT-NAACL04 (2004), 329--336.Google Scholar
Index Terms
Docear's PDF inspector: title extraction from PDF files





Comments