Finding pages on the unarchived Web

Kamps, J.; Ben-David, A.; Huurdeman, H.C.; de Vries, Arjen; Samar, Thaer

J. Kamps, A. Ben-David, H.C. Huurdeman, A.P. de Vries (Arjen) and T. Samar (Thaer)

2014

Finding pages on the unarchived Web

Presented at the IEEE/ACM Joint Conference on Digital Libraries, London, United Kingdom

Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies-most of the Web is unarchived and therefore lost to posterity. In this paper, we propose an approach to recover significant parts of the unarchived Web, by reconstructing descriptions of these pages based on links and anchors in the set of crawled pages, and experiment with this approach on the DutchWeb archive. Our main findings are threefold. First, the crawled Web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of the Web archive. Second, the link and anchor descriptions have a highly skewed distribution: popular pages such as home pages have more terms, but the richness tapers off quickly. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived Web: in a known-item search setting we can retrieve these pages within the first ranks on average.

Additional Metadata
Keywords	Web Archives, Web Archiving, Web Crawlers, Anchor Text, Link evidence, Information Retrieval
THEME	Information (theme 2)
Project	Web Archives Retrieval Tools
Conference	IEEE/ACM Joint Conference on Digital Libraries
Organisation	Human-Centered Data Analytics
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Kamps, J., Ben-David, A., Huurdeman, H. C., de Vries, A., & Samar, T. (2014). Finding pages on the unarchived Web.

Free Full Text ( Final Version , 308kb )

Finding pages on the unarchived Web

Publication

Publication

Address

CWI researchers

Questions or comments?

Finding pages on the unarchived Web

Publication

Publication

Workflow

Workflow

Add Content