10.1145/2467696.2467789acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
demonstration

Docear's PDF inspector: title extraction from PDF files

Online:22 July 2013Publication History

ABSTRACT

In this demo-paper we present Docear's PDF Inspector (DPI). DPI extracts titles from academic PDF files by applying a simple heuristic: the largest text on the first page of a PDF is assumed to be the title. This simple heuristic achieves accuracies around 70% and outperforms the tools ParsCit and SciPlore Xtract in both run-time and accuracy. In addition, DPI is released under the free open source license GPL 2+ at http://www.docear.org, written in JAVA, and runs on any major operating system.

References

  1. 1}Beel, J., Gipp, B., Langer, S. and Genzmehr, M. 2011. Docear: An Academic Literature Suite for Searching, Organizing and Creating Academic Literature. Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries (2011), 465--466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Beel, J., Gipp, B., Shaker, A. and Friedrich, N. 2010. SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size). Research and Advanced Technology for Digital Libraries, Proceedings of the 14th European Conference on Digital Libraries (ECDL'10) (Glasgow (UK), Sep. 2010), 413--416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Councill, I.G., Giles, C.L. and Kan, M.Y. 2008. ParsCit: An open-source CRF reference string parsing package. Proceedings of LREC (2008), 661--667.Google ScholarGoogle Scholar
  4. Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z. and Fox, E.A. 2003. Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries (2003), 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Hu, Y., Li, H., Cao, Y., Teng, L., Meyerzon, D. and Zheng, Q. 2006. Automatic extraction of titles from general documents using machine learning. Information Processing and Management. 42, (2006), 1276--1293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Peng, F. and McCallum, A. 2004. Accurate information extraction from research papers using conditional random fields. HLT-NAACL04 (2004), 329--336.Google ScholarGoogle Scholar

Index Terms

  1. Docear's PDF inspector: title extraction from PDF files

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          ACM Conferences cover image
          JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
          July 2013
          480 pages
          ISBN:9781450320771
          DOI:10.1145/2467696

          Copyright © 2013 Copyright is held by the owner/author(s)

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Online: 22 July 2013

          Qualifiers

          • demonstration

          Acceptance Rates

          Overall Acceptance Rate 334 of 1,195 submissions, 28%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!