Google Patent Granted on Duplicate Content Detection in a Web Crawler System
Some patents from the search engines provide detailed looks at how those search engines might perform some of the core functions behind how they work. By “core functions,” I mean some of the basics such as crawling pages, indexing those pages, and displaying the results to searchers.
For example, last December I wrote a post titled Google Patent on Anchor Text and Different Crawling Rates, about a Google patent filed in 2003 which gave us a look at how the search engine crawled web pages, and collected the web addresses, or URLs, of pages that it came across.
The patent the post covered was Anchor tag indexing in a web crawler system, and it revealed how Google may determine how frequently it might visit or revisit certain pages, including crawling some pages daily, and others even on a real-time or near real-time basis – every few minutes in some cases. While there’s been a lot of discussion in the past few months online about real-time indexing of web pages, it’s interesting to note that the patent was orginally filed in 2003.
That older patent also covered topics such as how a search engine crawler might handle temporary (302) redirects differently than permanent (301) redirects, by noting and sometimes following the temporary redirects immediately (to make a decision as to what page to show in search results), and collecting the URLs associated with permanent redirects and putting them into a queue where they might be addressed later – up to a week or more later.
It discussed how text surrounding links and anchor text found during the crawling of a page might be used as annotations for those links, and detailed some of the attributes that the search engine might be looking at when determining whether to associate that text with nearby links.
The patent also covered another very important topic – how to identify duplicate content that it might come across when crawling web pages, and how to identify the best address, or canonical URL for content. This is very important for a search engine – if the same content is found at multiple pages, a search engine may decide that it doesn’t want to spend time and resources indexing and displaying more than one source for the same content.
A related Google patent was granted this week that goes into more detail on how the search engine might handle duplicate content. It shares a couple of inventors with the patent on anchor text, and was filed on the same day. We’re told an early part of the description for this newly granted patent the reason for why Google might look for duplicate content during the crawling of web pages:
Meanwhile, it is becoming more and more common that there are many duplicate copies of a document sharing identical content, even though they may be physically stored at different web servers.
On the one hand, these duplicate copies of document are welcome because they reduce the possibility that shutting a one web server will render the documents on the web server unavailable; but on the other hand, they can significantly increase the workload and lower the efficiency of a search engine on both its front end and back end, if not dealt with appropriately.
For example, on the back end of a search engine, if duplicate copies of a same document are treated as different documents not related with one another in terms of their content, this would cause the search engine to waste resources, such as disk space, memory, and/or network bandwidth, in order to process and manage the duplicate documents.
On the front end, retaining duplicate documents would cause the search engine to have to search through large indices and to use more processing power to process queries. Also, a user’s experience may suffer if diverse content that should be included in the search results is crowded out by duplicate documents.
For these reasons, it would be desirable to develop a system and method of detecting duplicate documents crawled by a search engine before the search engine makes any further effort to process these documents.
It would also be desirable to manage these duplicate documents in an efficient manner such that the search engine can efficiently furnish the most appropriate and reliable content when responding to a query whose result set includes any of these duplicate documents.
The patent is:
Duplicate document detection in a web crawler system
Invented by Daniel Dulitz, Alexandre A. Verstak, Sanjay Ghemawat, Jeffrey A. Dean
Assigned to Google
US Patent 7,627,613
Granted December 1, 2009
Filed: July 3, 2003
Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents.
Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.
The description of the patent shares many details disclosed in the earlier granted patent on how Google might handle crawling and anchor text, describing for instance how some URLs for web pages are crawled on a periodic basis in a round-robin format over days or weeks or longer, some URLs are crawled daily, and other URLs are crawled multiple times during a day.
The duplicate document detection patent doesn’t focus too much upon anchor text, but instead provides more details on how a content filter from the search engine might work with a duplicate content server, or Dupserver as it’s called in the patent. The first step that the search engine may take after receiving a newly crawled page is to consult the Dupserver to see if it is a duplicate copy of another document, and if it is, to determine which version might be the canonical version.
This patent likely doesn’t cover all types of duplicate content that Google might find – many pages that contain duplicate content may differ in a number of ways, such as including very different templates filled with boilerplate content such as headers and footers and sidebars that change from one URL to another. Or pages that may contain some duplicated content and some unique content, or content duplicated from more than one source. The patent does define the kind of duplicate content that it does cover, and also tells us about how it might handle redirects and duplicate content associated with those:
Duplicate documents are documents that have substantially identical content, and in some embodiments wholly identical content, but different document addresses.
Accordingly, there are at least three scenarios in which duplicate documents are encountered by a web crawler:
two pages, comprising any combination of regular web page(s) and temporary redirect page(s), are duplicate documents if they share the same page content, but have different URLs;
two temporary redirect pages are duplicate documents if they share the same target URL, but have different source URLs; and
a regular web page and a temporary redirect page are duplicate documents if the URL of the regular web page is the target URL of the temporary redirect page or the content of the regular web page is the same as that of the temporary redirect page.
A permanent redirect page is not directly involved in duplicate document detection because the crawlers are configured not to download the content of the target page. However, a regular web page or a temporary redirect page may contain a URL in its content, which happens to be the source URL of a permanent redirect page. Therefore, besides detecting duplicate documents, the Dupserver is also tasked with the job of replacing source URLs embedded in the content of a regular web page or a temporary redirect page with the corresponding target URLs of permanent redirects known to (i.e., stored in) the Dupserver.
The patent details some of the different duplicate content detection methods that it might use, such as taking fingerprints of the content found on pages to match content from one page to another, and how that information might be stored within content finger print tables, and the selection of canonical URLs for content.
A very quick reading of the patent might lead someone to think that the URL with the highest PageRank might be the version chosen as the canonical URL for that content, but the patent tells us that sometimes “a canonical page of an equivalence class is not necessarily the document that has the highest score (e.g., the highest page rank or other query-independent metric).”
We are given one example of this – Google might log all of the pages that it finds with duplicate content, and when it comes across a new duplicate, it might look at the PageRank (or some other query independent ranking), and see if that new URL has a PageRank that is higher by some signficant margin before it might name the new URL as the canonical URL. It’s possible that other factors are taken into consideration as well, though the patent doesn’t explicitly name them.
Even though this patent on duplicate content, and the related patent on anchor text were filed more than 6 years ago, they are worth spending some time with because of the way that they lay out in detail the ways that Google might crawl web pages, and collect and process information from those pages. If you are interested in how search engines work, these two documents provide some interesting insights into issues during the crawling of web pages, such as:
- How Google may handle temporary and permanent redirects,
- How Google determines different crawling rates for pages,
- How Google may decide which URL with duplicate content might be considered the Canonical URL,
- How text around links might be chosen to act as annotations for those links,
- How anchor text pointed to duplicate documents might be associated with the canonical version of the document.
I’ve written a number of posts on duplicate content before:
- Google to Help Content Creators Find Unauthorized Duplicated Text, Images, Audio, and Video?
- Same-Site Duplicate Pages at Different URLs
- New Google Process for Detecting Near Duplicate Content
- Google Omits Needless Words (On Your Pages?)
- Microsoft Explains Duplicate Content Results Filtering
- Solving Different URLs with Similar Text (DUST)
- Duplicate Content Issues and Search Engines