How Google Web Search copes with very similar documents

Mettrop, Wouter; Nieuwenhuysen, P.; Smulders, H.

W. Mettrop (Wouter), P. Nieuwenhuysen and H. Smulders

2006

How Google Web Search copes with very similar documents

Presented at the Current Research in Information Sciences and Technologies: Multidisciplinary Approaches to Global Information Systems, Instituto Abierto del Conocimiento / Open Institute of Knowledge - Mérida, Spain

A significant portion of the computer files that carry documents, multimedia, programs etc. on the Web are identical or very similar to other files on the Web. How do search engines cope with this? Do they perform some kind of “deduplication”? How should users take into account that web search results are influenced by “deduplication”? We have investigated this deduplication function of the Google Web search engine. The focus on Google Web Search is motivated by the high popularity of this Web search engine. We developed a well-controlled experimental environment, with very similar test documents on various Web server computers in two countries and with automated scripts on a client computer. We report here the results of this investigation. We found that users may miss documents due to deduplication, and that it is not straightforward to cope with this due to complications as follows. We observed various types of deduplication and in the query result sets we noted changes/fluctuations over time. Part of these changes over time occurred only once in a series of measurements, while others were continuous, persistent, and thus more significant. This work is also motivated by the following: Variations in the contents of documents can be considered as small in deduplicating computer systems, which leads to hidden documents, while the same small variations can create quite different meanings for a human user and reader. This is probably the first investigation of deduplication in Web search from the user’s point of view

Additional Metadata
Keywords	Google Web search, very similar documents, deduplication, fluctuations
Publisher	Instituto Abierto del Conocimiento / Open Institute of Knowledge
Conference	Current Research in Information Sciences and Technologies: Multidisciplinary Approaches to Global Information Systems
Organisation	Communication and Information
Citation APA APA Style APA-ALL Style AAA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Mettrop, W., Nieuwenhuysen, P.& Smulders, H. (2006). How Google Web Search copes with very similar documents. Proceedings of Current Research in Information Sciences and Technologies: Multidisciplinary Approaches to Global Information Systems 2006, 59–63.

Free Full Text ( Final Version , 173kb )

How Google Web Search copes with very similar documents

Publication

Publication

Address

CWI researchers

Questions or comments?

How Google Web Search copes with very similar documents

Publication

Publication

Workflow

Workflow

Add Content