Google’s Listings of Internal Site Links for Top Search Results
Sometimes when you see the top search result in Google, under it is a list of links to other pages on the same site. Ever wondered how and why that happens?
For example, a search for “wordpress” shows the wordpress.org page at the top of the search engine results page, with a link to the site, and a snippet of text from the wordpress home page. Then under that are links to other pages on that site, including; Download, Hosting, Extend, Blog, and “More results from wordpress.org.”
Here’s an image, from the US Patent and Trademark Office, of a result for the search “hp” which shows a link to the HP web site at the top of the results, and additional links to pages from that site:
I’ve had a few people ask me how Google does that, and while I could provide some ideas, I couldn’t provide more than that. Google published a patent application this week which gives a little more insight into the process.
Systems and methods for providing search results
Invented by Luis Castro, Walt Lin, and Benedict Gomes
US Patent Application 20060287985
Published December 21, 2006
Filed June 20, 2005
A method includes generating search results in response to a user query, where at least one of the search results includes a group of links. The group of links may represent links to web pages within a same web site and may be identified based on at least one factor associated with the links. The method may also include providing the search results to the user.
The questions that I had as I first started looking at this document were:
- How are the pages included in the list chosen?
- Why show lists of links for some web sites, and not others?
- Is it always only the first search result that will show additional links?
- What can I do, if anything, that might make it easier for the search engine to add a list of links for a site?
Which Pages are Listed?
How does the search engine choose which pages to show in these sitelinks? The patent application tells us that those pages are the ones that searchers might most likely want to access.
This could be based upon a log file analysis which tells the search engine:
- How many times the page has been accessed.
- How long visitors stayed upon the page.
- If a visitor scrolled down the page, or clicked on a link without scrolling down.
- Information retrieval scores for the page, along with an indication of how good a match the page may be for the query that was used in the search.
- The likelihood that someone might make a purchase on that page.
- Other information that might indicate that someone would be interested in the page.
How are Pages Chosen to Have Lists?
One possibility is that the pages have enough traffic so that Google can make some meaningful choices regarding which additional pages to show for a site from a log file analysis.
That log file information would be used to create a map of the pages of a site, and maintain some quality score information about the pages like I’ve listed above. Other information could also be used, such as the number of links pointing to those pages from other web pages (the patent application doesn’t explicitly make a distinction here between internal site links or external ones).
What Determines the Ordering of Those Additional Links?
A map, or list, of pages from the site would be created which includes a quality measure associated with those pages. The quality measure may represent:
- Popularity associated with a web page,
- Likelihood that the information on a web page will be accessed by a user,
- Likelihood that the information will be useful to a user submitting a search query, or;
- Other factors associated with the quality of a web page.
The order of those pages in the list would be determined by the quality scores for the pages.
Where would the Search Engine Get the Log File Information?
The patent application describes how this mapping of pages, and assignment of quality scores works, and for this example uses information gathered from search engine and toolbar usage.
In the conclusion area of the patent application it notes that an alternative approach might be to allow the siteowner to identify what they believe might be the most important pages of the site:
It also notes that it might be possible, based upon different users’ past search histories, to provide different lists of links to different searchers.
It’s interesting, but not terribly surprising, that so much of the generation of these additional links are based upon user-behavior based information. The patent does note that it is only the top result they are showing these additional links for, so to have lists like this appear, it’s helpful to rank pretty well.
Beyond being number one, the first step in getting Google to show additional links from your site may be to get lots of traffic to your pages. It’s hard to tell how much is enough, but it has to be enough for them to think that this will be a good user experience for searchers to list those pages.
The second may be to have a core group of pages that tend to get visited more than other pages of the site – the only reason to list pages like this is if you are helping make it easier for searchers find what they may be looking for.
When someone visits the wordpress site, there is a small identifiable core group of things that they may want to do once there. When they visit the front page of Wikipedia or Digg, they may be interested in any number of pages. When you do a search for wordpress, you’ll see a list of links to additional pages under the wordpress site. When you search for Digg or Wikipedia, you don’t. (Both do have second indented results, which are relevant for the search term, with a link for “more results from” those sites – but that’s not the same thing.)