skip to main content
10.1145/2756406.2756940acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
short-paper

No More 404s: Predicting Referenced Link Rot in Scholarly Articles for Pro-Active Archiving

Published: 21 June 2015 Publication History

Abstract

The citation of resources is a fundamental part of scholarly discourse. Due to the popularity of the web, there is an increasing trend for scholarly articles to reference web resources (e.g. software, data). However, due to the dynamic nature of the web, the referenced links may become inaccessible ('rotten') sometime after publication, returning a "404 Not Found" HTTP error. In this paper we first present some preliminary findings of a study of the persistence and availability of web resources referenced from papers in a large-scale scholarly repository. We reaffirm previous research that link rot is a serious problem in the scholarly world and that current web archives do not always preserve all rotten links. Therefore, a more pro-active archival solution needs to be developed to further preserve web content referenced in scholarly articles. To this end, we propose to apply machine learning techniques to train a link rot predictor for use by an archival framework to prioritise pro-active archiving of links that are more likely to be rotten. We demonstrate that we can obtain a fairly high link rot prediction AUC (0.72) with only a small set of features. By simulation, we also show that our prediction framework is more effective than current web archives for preserving links that are likely to be rotten. This work has a potential impact for the scholarly world where publishers can utilise this framework to prioritise the archiving of links for digital preservation, especially when there is a large quantity of links to be archived.

References

[1]
S.G. Ainsworth, A. Alsum, H.Salah Eldeen, M.C.Weigle, and M.L.Nelson. How much of the web is archived? In JCDL, JCDL '11, pages 133--136, 2011.
[2]
J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In VLDB, pages 200--209, 2000.
[3]
T.Fawcett. An introduction to ROC analysis. Pattern recognition letters, 27(8):861--874, 2006.
[4]
D.Fetterly, M.Manasse, M.Najork, and J. Wiener. A large-scale study of the evolution of web pages. In WWW, pages 669--678. ACM, 2003.
[5]
D.Gomes, S.Freitas, and M.J. Silva. Design and selection criteria for a national web archive. In Research and Advanced Technology for Digital Libraries, pages 196--207. Springer, 2006.
[6]
M.Klein, H. Vande Sompel, R. Sanderson, H. Shankar, L. Balakireva, K. Zhou, and R. Tobin. Scholarly context not found: One in five articles suffers from reference rot. PloS one, 9(12):e115253, 2014.
[7]
S.Lawrence, D. M. Pennock, G. W. Flake, R. Krovetz, F. M. Coetzee, E. Glover, F. A. Nielsen, A. Kruger, and C. L. Giles. Persistence of web references in scientific research. Computer, 34(2):26--31, 2001.
[8]
A. Ritchie, S. Robertson, and S. Teufel. Comparing citation contexts for information retrieval. In CIKM, pages 213--222. ACM, 2008.
[9]
M. Spaniol, D. Denev, A. Mazeika, G. Weikum, and P. Senellart. Data quality in web archiving. In Proceedings of the 3rd workshop on Information credibility on the web, pages 19--26. ACM, 2009.
[10]
H. Van de Sompel, M. Nelson, and R. Sanderson. HTTP framework for time-based access to resource states-Memento, 2012. http://tools.ietf.org/html/rfc7089.
[11]
K. Zhou, R. Tobin, and C. Grover. Extraction and analysis of referenced web links in large-scale scholarly articles. In Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '14, pages 451--452, 2014.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
JCDL '15: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries
June 2015
324 pages
ISBN:9781450335942
DOI:10.1145/2756406
  • General Chairs:
  • Paul Logasa Bogen,
  • Suzie Allard,
  • Holly Mercer,
  • Micah Beck,
  • Program Chairs:
  • Sally Jo Cunningham,
  • Dion Goh,
  • Geneva Henry
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. digital preservation
  2. repositories
  3. web persistence

Qualifiers

  • Short-paper

Conference

JCDL '15
Sponsor:
JCDL '15: 15th ACM/IEEE-CS Joint Conference on Digital Libraries
June 21 - 25, 2015
Tennessee, Knoxville, USA

Acceptance Rates

JCDL '15 Paper Acceptance Rate 18 of 60 submissions, 30%;
Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Link-Rot in Web-Sourced Multimedia DatasetsMultiMedia Modeling10.1007/978-3-031-27077-2_37(476-488)Online publication date: 29-Mar-2023
  • (2021)Interoperability for Accessing Versions of Web Resources with the Memento ProtocolThe Past Web10.1007/978-3-030-63291-5_9(101-126)Online publication date: 1-Jul-2021
  • (2019)Identifying PIDs playing FAIRData Science10.3233/DS-190024(1-16)Online publication date: 4-Nov-2019
  • (2019)Dead Science: Most Resources Linked in Biomedical Articles Disappear in Eight YearsInformation in Contemporary Society10.1007/978-3-030-15742-5_16(170-176)Online publication date: 13-Mar-2019
  • (2018)Micro Archives as Rich Digital Object RepresentationsProceedings of the 10th ACM Conference on Web Science10.1145/3201064.3201110(353-357)Online publication date: 15-May-2018
  • (2018)About a BUOI: Joint Custody of Persistent Universally Unique Identifiers on the Web, or, Making PIDs More FAIRSemantics, Analytics, Visualization10.1007/978-3-030-01379-0_3(33-48)Online publication date: 31-Oct-2018
  • (2016)Cytowania zasobów sieciowych w polskich czasopismach z zakresu bibliotekoznawstwa i informatologii: analiza aktualności adresów URLZagadnienia Informacji Naukowej - Studia Informacyjne10.36702/zin.15354:1(107)(21-43)Online publication date: 3-Jan-2016
  • (2016)Persistent URIs Must Be Used To Be PersistentProceedings of the 25th International Conference Companion on World Wide Web10.1145/2872518.2889352(119-120)Online publication date: 11-Apr-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media