A significant portion of the computer files that carry documents, multimedia, programs etc. on the Web are identical or very similar to other files on the Web. How do search engines cope with this? Do they perform some kind of “deduplication”? How should users take into account that web search results are influenced by “deduplication”? We have investigated this deduplication function of the Google Web search engine. The focus on Google Web Search is motivated by the high popularity of this Web search engine. We developed a well-controlled experimental environment, with very similar test documents on various Web server computers in two countries and with automated scripts on a client computer. We report here the results of this investigation. We found that users may miss documents due to deduplication, and that it is not straightforward to cope with this due to complications as follows. We observed various types of deduplication and in the query result sets we noted changes/fluctuations over time. Part of these changes over time occurred only once in a series of measurements, while others were continuous, persistent, and thus more significant. This work is also motivated by the following: Variations in the contents of documents can be considered as small in deduplicating computer systems, which leads to hidden documents, while the same small variations can create quite different meanings for a human user and reader. This is probably the first investigation of deduplication in Web search from the user’s point of view

, , ,
Instituto Abierto del Conocimiento / Open Institute of Knowledge
Current Research in Information Sciences and Technologies: Multidisciplinary Approaches to Global Information Systems
Communication and Information

Mettrop, W., Nieuwenhuysen, P., & Smulders, H. (2006). How Google Web Search copes with very similar documents. In Proceedings of Current research in information sciences and technologies: Multidisciplinary approaches to global information systems 2006 (pp. 59–63). Instituto Abierto del Conocimiento / Open Institute of Knowledge.