One of the technical issues that can cause problems with a search engine crawling a site to index its pages is when the content of pages on that site appears more than once on the site at different URLs (Unique resource locators, or web page addresses).
Unfortunately, this problem happens more frequently than it should.
A new patent application from Yahoo explores how they might handle dynamic URLs to avoid this problem. What is nice about the patent application is that it identifies a number of the problems that might arise because of duplicate content at different web addresses on the same site, and some approaches that they might use to solve the problem.
While search engines like Yahoo can resolve some of the issues around duplicate content, its often in the best interest of site owners to not rely upon search engines to fix this problem on their own.
Avoiding the Crawling of Duplicate Pages
The order that pages appear in the results of a search at a search engine may be influenced by the number of pages that link to that page, and by rankings of the pages that link to that page.
When a site is linked to by a popular and trusted domain, that link might provide more value (and a higher ranking) than a link from a site that is less popular and trusted.
Ages of Linking Domains
A new patent application from Microsoft adds another twist, by also ranking domains based upon the ages of domains which link to those domains.
You go to a search engine, and type some query terms in the search box. A list of results is returned by the search engine, and you visit a link to one of the results that appears.
Looking through the page, you may not see your query terms on the page itself. Why would the search engine return that result to you?
Determining Relevance from Anchor Text
One reason might be that the search engine is looking at the anchor text in links pointing to the page to determine that the page is relevant for your query terms.
This can be very helpful when a page doesn’t have much text on it, such as a video or an audio file, or where the amount of text is very limited or is non-existent.
A patent application from Microsoft explores the use of anchor text to define the context of a page and terms that it might rank for that don’t appear upon that page.
If a search engine could understand the layout of a web page and identify the most important part of a web page, it could pay more attention to that section of the page when indexing content from the page.
It could give links found within that section of the page more weight than links found in other sections of the page, and it could consider information within that area more weight when determining what the page is about.
We’ve seen the idea of breaking pages up into parts from a couple of the major commercial search engines:
On one level, a search engine indexes a web site by crawling that site one URL at a time, collecting information about what it finds at that address, and indexing the information found so that it can be served to visitors later.
But, the process can be more complicated than that.
For instance, a search engine may try to understand more about specific sites by collecting information on a site wide basis.
Site Wide Information about Web sites
Information that a search engine might look at about a web site on a site wide level might include:
A new patent application on near duplicate content from Google explores using a combination of document similarity techniques to keep searchers from finding redundant content in search results.
The Web makes it easy for words to be copied and spread from one page to another, and the same content may be found at more than one web site, regardless of whether its author intended it to be or not.
How do search engines cope with finding the same words, the same content, at different places?
How might they recognize that the pages that they would want to show searchers in response to a search query contain the same information and only show one version so that the searchers don’t get buried with redundant information?
A lot of web pages and documents reuse the same text in sidebars and in footers at the bottoms of pages, like copyright notices and navigation sidebars.
Computer programmers will sometimes use the term “boilerplate” code to refer to standard stock code that they often insert into programs. Lawyers use legal boilerplate in contracts – often the small print on the back of a contract that doesn’t change regardless of what a contract is about.
It might be a good step for a search engine to ignore boilerplate text when it indexes pages, or uses the content of pages to create query suggestions for someone using a desktop personalized search. Ignoring boilerplate in the same documents could be helpful when using those documents to rerank search results in personalized search.
New York Times Boilerplate
Search engines are getting smarter about the phrases that they see and understand online, and Yahoo recently published a patent application that describes a number of the ways that they learn about and understand the use of phrases in documents on the Web.
Exploring how Yahoo might use phrases to rerank search results may show how they may try to understand data from published documents on the Web, and from log files that collect information about the queries that people use when they search for information about different concepts.
From Keyword Matching to Phrase-Based Indexing
A page’s placement in search results for certain queries can involve looking at ranking criteria and algorithms applied to documents involving keywords in search queries for things like: