How well do search engines understand the linking structure of a web site? Do they have ways to organize and classify individual links and blocks of links that they see on the pages of a site?
Do they treat links and collections of links that they find on more than one page of a site differently than links and collections of links only on one page? If they find more than one group of links on a page that contain many of the same links, though at the top and bottom of the page, how might they treat those links?
I came across a patent filing from Microsoft from last summer that explored many of these topics, as well as others. It hadn’t drawn much attention, so I decided to take a closer look at it here.
Segmentation and Link Blocks
One of the things that I like to do for sites that I work upon is to create an SEO content inventory.
I find it helpful to have information all in one place about the content that might appear on different pages of a site, and it can be very useful as a planning tool. The idea isn’t new, and usability.gov has a nice description of why it can be helpful to conduct a content inventory on their pages from a design stance.
Jeffrey Veen also published a post a number of years ago about using a tool like this when he works on information architecture and design issues for clients, in Doing a Content Inventory (Or, A Mind-Numbingly Detailed Odyssey Through Your Web Site).
One of the differences between the approach that usability.gov and Jeffrey Veen use, and the one that I like to use is that I include more details involving search engine optimization. For instance, in my inventory, there’s a space for the “present” page title, meta description, and meta keywords, and “future” title, meta description, and meta keywords.
Can looking at how many times rare words appear in a search engines index give us an idea of the size of the database for that search engine?
About a week ago, I wrote about some of the most common English words in the indexes for Google, Yahoo, Bing, Ask, and Google Caffeine. I took a look at 50 words that are amongst the most frequently appearing words in English, and estimates from those search engines about the number of times that those words showed up.
Comparing the number of results between the different search engines for those common words really didn’t tell us anything about the relative sizes of the indexes for those search engines for a number of reasons.
One is that the number of results shown are rough estimates only. It’s also possible that the way that estimates are calculated from one search engine to another are very different. Some of the pages listed among those results are likely duplicate pages at different URLs, or may have contained misspellings of the words. Some of the words may be abbreviations or acronyms, as well (such as “it” being an abbreviation for information technology).
Sometimes Google, Yahoo, and Bing will show additional links for a search result under the description for that result. These are often referred to as Site Links, Sitelinks, or Quick Links by the search engines. An example is the sitelinks that Google shows in a search result for the WordPress site when someone searches at Google for wordpress:
None of the search engines presently allow site owners or webmasters to choose the links that show up as sitelinks or quicklinks in those search results. Google provided some hints as to how sitelinks might be chosen in the description of the patent on Google Sitelinks. Yahoo also gave us some information about the sources of information that they use when they include quicklinks for a search result in a white paper on Yahoo quicklinks. Microsoft also released a whitepaper on how they might include links such as sitelinks to give searchers a chance to find what they call final destination pages.
Some of the choices for sitelinks and the text for those links that Google chooses aren’t really helpful for searchers or ideal for webmasters, such as the the site link at the bottom right in an “SEO by the Sea” search result as seen in the next image:
Are large news agencies, with a wide scope of international coverage on multiple topics, with large numbers of reporters, and finely edited articles better sources of news than smaller and more local papers, or narrow niche blogs?
A patent on ranking articles in Google News was granted this week that was originally filed in 2003, and it discusses a number of ranking factors that it might use to present news article based upon the “quality” of the news sources involved.
What is very interesting about it is that it provides some insight into the assumptions behind those ranking factors. I suspect that Google may have changed their stance on some of the assumptions behind those factors since then.
The patent doesn’t include a full range of signals that Google probably considers in ranking news stories, such as the freshness of the news (as noted in Google’s patent filing on Universal Search), or whether or not a certain source is the original.
Just which words show up most frequently on the Web? I’m not sure that question can be answered, but it’s something I’ve wondered for a while.
With a beta version of Google’s future update, code named Caffeine recently released to allow people to experiment with, I thought I would do a few comparisons.
I found a few lists of the most common words in the English language, and came up with a top 50 to see how frequently those were estimated to show up in Google, Yahoo, Bing, Ask, and Google Caffeine. Those are shown in a table and a chart below.
I’m not sure how informative this might be, even after looking at it. It’s not a very scientific test as well. There are a few reasons for that:
With billions of pages on the Web, trying to find the right words to use when you want to search for something can often be hard, especially when you’re looking for information on a topic that you don’t know too much about.
As a designer or site owner, coming up with the words on your pages that searchers expect to see, and may use to search for what you have to offer can also be difficult.
Search engines often act as a middle man between searcher and site owner, helping to bring people to pages that may help them satisfy some kind of informational need, or accomplish some task. Again, that can be difficult, trying to get some idea of what people are looking for with two or three words, and showing a list of web pages that might be helpful to those searchers.
Search engines will often offer search suggestions based upon the words that a searcher types into a search box, to try to make it easier for those searchers. Knowing more about how search engines find those suggestions may be helpful to searchers and to site owners.
You might see these query suggestions above or below the search results that you are shown when you search, or appearing in a dropdown under the search box as you type. They often are displayed with terms like the following showing up in front of them:
If you look up when the last five movies from Jim Carrey were released, and were able to sneak a peek at Google’s query logs, you’d see that searches for Jim Carrey spiked on those dates.
Same with Ben Stiller, Edward Norton, Leonardo Dicaprio, and Tom Hanks.
We know this from a footnote in a recently published paper from researchers at Google.
The authors of Gazpacho and summer rash: lexical relationships from temporal patterns of web search queries checked to see if there was some kind of time-based relationship between searches for those movies’ names (and release dates) and the names of those actors.
It sounds obvious that there would be, but it’s interesting to see actual data from Google that explores relationships like that.
Relationships between Queries Based upon Time