Uncovering the unarchived web

Samar, Thaer; Huurdeman, H.C.; Ben-David, A.; Kamps, J.; de Vries, Arjen

T. Samar (Thaer), H.C. Huurdeman, A. Ben-David, J. Kamps and A.P. de Vries (Arjen)

2014-07-01

Uncovering the unarchived web

Presented at the Annual ACM SIGIR Conference, Gold Coast , QLD, Australia

Many national and international heritage institutes realize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection's (aura): the web documents that were not included in the archived collection, but are known to have existed --- due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived URLs that can be derived from the crawls by combining crawl date distribution, anchor text and link structure. We illustrate empirically that the size of the aura can be substantial: in 2012, the Dutch Web archive contained 12.3M unique pages, while we uncover references to 11.9M additional (unarchived) pages.

Additional Metadata
Keywords	Web Archives, Web Archiving, Web Crawlers, Anchor Text, Web Graph, Information Retrieval
THEME	Information (theme 2)
Publisher	ACM
Project	Web Archives Retrieval Tools
Conference	Annual ACM SIGIR Conference
Organisation	Human-Centered Data Analytics
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Samar, T., Huurdeman, H. C., Ben-David, A., Kamps, J., & de Vries, A. (2014). Uncovering the unarchived web. In Proceedings of Annual ACM SIGIR Conference 2014 (SIGIR 37) (pp. 1199–1202). ACM.

Free Full Text ( Final Version , 441kb )

Additional Files
Fulltext Final Version
Publisher Version

Uncovering the unarchived web

Publication

Publication

Address

CWI researchers

Questions or comments?

Uncovering the unarchived web

Publication

Publication

Workflow

Workflow

Add Content