The order that pages appear in the results of a search at a search engine may be influenced by the number of pages that link to that page, and by rankings of the pages that link to that page.
When a site is linked to by a popular and trusted domain, that link might provide more value (and a higher ranking) than a link from a site that is less popular and trusted.
Ages of Linking Domains
A new patent application from Microsoft adds another twist, by also ranking domains based upon the ages of domains which link to those domains.
You go to a search engine, and type some query terms in the search box. A list of results is returned by the search engine, and you visit a link to one of the results that appears.
Looking through the page, you may not see your query terms on the page itself. Why would the search engine return that result to you?
Determining Relevance from Anchor Text
One reason might be that the search engine is looking at the anchor text in links pointing to the page to determine that the page is relevant for your query terms.
This can be very helpful when a page doesn’t have much text on it, such as a video or an audio file, or where the amount of text is very limited or is non-existent.
A patent application from Microsoft explores the use of anchor text to define the context of a page and terms that it might rank for that don’t appear upon that page.
If a search engine could understand the layout of a web page and identify the most important part of a web page, it could pay more attention to that section of the page when indexing content from the page.
It could give links found within that section of the page more weight than links found in other sections of the page, and it could consider information within that area more weight when determining what the page is about.
We’ve seen the idea of breaking pages up into parts from a couple of the major commercial search engines:
On one level, a search engine indexes a web site by crawling that site one URL at a time, collecting information about what it finds at that address, and indexing the information found so that it can be served to visitors later.
But, the process can be more complicated than that.
For instance, a search engine may try to understand more about specific sites by collecting information on a site wide basis.
Site Wide Information about Web sites
Information that a search engine might look at about a web site on a site wide level might include:
A new patent application on near duplicate content from Google explores using a combination of document similarity techniques to keep searchers from finding redundant content in search results.
The Web makes it easy for words to be copied and spread from one page to another, and the same content may be found at more than one web site, regardless of whether its author intended it to be or not.
How do search engines cope with finding the same words, the same content, at different places?
How might they recognize that the pages that they would want to show searchers in response to a search query contain the same information and only show one version so that the searchers don’t get buried with redundant information?
A lot of web pages and documents reuse the same text in sidebars and in footers at the bottoms of pages, like copyright notices and navigation sidebars.
Computer programmers will sometimes use the term “boilerplate” code to refer to standard stock code that they often insert into programs. Lawyers use legal boilerplate in contracts – often the small print on the back of a contract that doesn’t change regardless of what a contract is about.
It might be a good step for a search engine to ignore boilerplate text when it indexes pages, or uses the content of pages to create query suggestions for someone using a desktop personalized search. Ignoring boilerplate in the same documents could be helpful when using those documents to rerank search results in personalized search.
New York Times Boilerplate
Search engines are getting smarter about the phrases that they see and understand online, and Yahoo recently published a patent application that describes a number of the ways that they learn about and understand the use of phrases in documents on the Web.
Exploring how Yahoo might use phrases to rerank search results may show how they may try to understand data from published documents on the Web, and from log files that collect information about the queries that people use when they search for information about different concepts.
From Keyword Matching to Phrase-Based Indexing
A page’s placement in search results for certain queries can involve looking at ranking criteria and algorithms applied to documents involving keywords in search queries for things like:
Many patent filings and papers from the search engines discuss ways that they might shuffle around search results to try to provide more relevant responses to people’s searches.
Imagine a search engine changing around the results that you see, not based upon the time that a page is published, but rather on some estimate of the importance of a page to you, and how that importance might vary with time and your personal calendar. Sounds like a tricky proposition, doesn’t it?
Microsoft adds an element of time to search results by introducing a search system that pays more attention to what is happening on your computer and within your company intranet.
This more complex search system could be used to index both search results and information found on a person’s desktop and local network. The search system would pay more attention to the context of searches and add personalization to those searches by building a user profile to distinquish how important different information might be to each individual searcher.