Does Google determine categories for pages and for queries, and can those play a role in how it ranks pages in search results?
Almost everyday, I receive visitors on a query for “bookshelf plans,” on the strength of a past post about Google’s plans for virtual bookshelves in Google library. Most of those visitors probably aren’t surprised that the page is about an online library given the title and snippet appearing for the post, but most of the search results preceeding it describe wooden rather than virtual shelves. My page really doesn’t fit within the same category as the others.
When a search engine determines whether a page is relevant for a certain query, it does more than try to match the text of the query with a page that contains that text, and looking at the links pointing to the page. A Google patent filed in 2004, and granted today describes how the search engine may try to associate web pages with categories, and queries with categories, and come up with a category score for each, to use to rank those pages for categories.
We are told that this kind of category matching addresses a couple of different problems.
One problem is that text matching by search engines has been taken too literally – if I search for “car mechanic,” I’m also searching for “auto mechanic,” or “automobile mechanic.” If the word “car” doesn’t appear on a page about an “auto mechanic,” under a pure text matching approach, I wouldn’t see that page in search results.
Another problem is that words or phrases used in queries can often have more than one sense in which they are used. My bookshelves example illustrates that point. If you’re looking for “bookshelves plans,” chances are that you’re considering building a bookshelf and you want blue prints or instructions. My blog post was about Google’s plans to create virtual bookshelves. The word “plans” might be interpreted as actual blueprints for construction, or as a strategy for moving forward on a project. The term “bookshelves” might refer to furniture to hold books, or virtual places to hold information about books.
The patent is:
System and method for determining a composite score for categorized search results
Invented by Karl Pfleger and Brian Larson
Assigned to Google
US Patent 7,814,085
Granted October 12, 2010
Filed: February 26, 2004
Abstract
A system and method for scoring documents is described. One or more documents are identified responsive to a search criteria. A text match score indicating a quality of match of the identified documents is determined. A category match score is determined over categories. A document-categories score is determined indicating a quality of match between an identified document and a plurality of categories.
A search criteria-categories score is determined indicating a quality of match between the search criteria and the categories. An overall score is determined based on the text match score and the category match score.
Categories for pages and queries might be created manually, through an automated process, or by a combination of both. These categories could be defined as a list, or in a hierarchical manner, or in some other way, and documents and queries may be assigned more than one category. Pages and query terms may be associated with specific categories based upon some relative strength of correlation to those categories. The strength of those association with categories might vary from one page to another, or from one query to another.
An association strength would be determined between specific documents and categories, and for each query term and categories. This assocation strength would be used together with a text based matching score to determine which pages rank for which query terms.
Product Search Example
We’re told in the patent that scores for pages about products tend to be easy to place into categories. For example, a hierarchical category structure could be created for “Household.” Under that larger category might be smaller categories such as “Cleaning Supplies,” “Lawn Care,” “Maintenance,” and “Decorative.” Even smaller categories could be created under those that might include things such as “Brooms,” “Mops,” “Vacuum Cleaners,” “Rakes,” “Mowers,” “Flamingos,” and “Gnomes.”
A page (or query) about “Flamingos” might fall within a categorized list, such as:
Household > Lawn Care > Decorative > Flamingos
When a page is indexed, it might be given a text-based score for ranking, as well as a category score. A page about Flamingos would be given a category score based upon how well it correlates with flamingos compared to other pages about flamingos.
A page about lawn decorations, which includes information about flamingos and lawn gnomes might fit into both the flamingo category and the gnome category, but the page’s correlation score for flamingos might not be as high as a correlation score for page only about flamingos.
Conclusion
It’s possible that Google has used category matching like this, between query terms and web pages at some point within the past 6 or 7 years. It’s also possible that Google has since moved on to other approaches that try to solve the problems this patent was intended to solve.
This kind of category matching described in the patent is one approach to determining relevance. There are a few other ways that the concept of relevance might be applied by a search engine.
One is a direct matching of keywords between a document and a query term or phrase. Under this sense of the word relevance, if the same words appear in a query and on a page, the page is relevant for the term. That’s the kind of relevance determination that this patent was aiming to improve upon.
More recently, search engines have been trying harder to interpret the intent behind queries. This kind of interpretation might be fairly simple, such as trying to return transactional sites in search results when someone’s query term might be written like “buy xxxxx”. It could focus upon returning informational pages when a query begins with a phrase such as “How to”. A search engine might strive to return a specific website to a searcher on a query that it believes is navigational in nature, meaning that the searcher was likely trying to find a specific web site, such as when I type ESPN in my search toolbar as a shortcut to get to the ESPN website.
Another kind of relevance is one that fulfills a situational need. For example, if I search for “Pizza,” the search engines try to include within search results links to local pizzerias based upon the notion that the situational intent behind my search is to find a place to get a slice or two of pizza.
One method that Google is using to somewhat address one of the problems that I mentioned at the start of this post – search engines taking query terms too literally – has been to include synonyms for query terms when appropriate. That would solve the problem I described with a search for “car mechanic” not showing results for “auto mechanic.”
Google may be using category matching as a ranking signal, but they’ve been developing other approaches that may yield better results since the filing of this patent.
I believe Google is still using this method but to a lesser extent like bill explained in this post.
It basically boils down to LSI – Latent Semantic Indexing combined to categories.
Just look at how Google categorizes your website in AdWords or Analytics (if you use it).
I’ve actually heard about this method before, and I suppose they’re already implementing it. I think that Google Caffeine is, in part, an effort in this direction.
That is correct, Andrew – especially with the algo. responsible for English analysis. With odd languages, as far as I can tell, they’re still in a bit of a struggle, but it’s a matter of time before they catch up.
I’ve read somewhere before that it’s okay to use variations of keywords in a post because the search engine would know and can analyze those keywords. I think this has certainly addressed search engines taking query terms too literally.
I agree with Jeremy. Caffeine seems to be part of this which would explain why some have reported that long tail traffic has reduced. This could indicate some sites that were scooping up long tail were getting traffic from queries in which they were improperly categorized (as Google views it 🙂 ).
I believe the Wonder Wheel could support this. This would also explain the seeming effectiveness and growing popularity of “siloing” aka properly categorizing your pages and posts. Essentially, what you described about (Household > Lawn Care > Decorative > Flamingos) would be an exact siloing structure.
Interesting!
Its kind of funny but also shows that you do not get quality anymore for your money nowadays. Google is worth billions, spent millions on its algorithm, analyzed billions of pages since its inception and still sends DIY’s who are looking for ‘bookshelf plans’ to an ‘online marketing’ site.
Hi Jeremy,
It’s possible that Google is still looking at categories, as one of a number of potential ranking signals to use to identify and rank content. Ideally, this approach helps most with pages that don’t have many links pointing to them, don’t have much in the way of anchor text to help identify the content available at specific websites, and so on.
I’m really loathe to say that LSI is involved in any way with what Google does for a number of reasons.
The first of them is that I’ve only seen LSI mentioned in one Google patent, and not as the basis of any indexing approaches that Google may be using, but as a possible area of future investigation for a form of fact extraction that Sergey Brin wrote about in a patent filed back in 2003.
The second reason is that LSI dates back to the late 80s and early 90s, and seems to focus primarily upon a form of information retrieval for smaller and fairly static databases that don’t change frequently the way that the Web does. There have been other indexing methods that have developed since then that seem to be more robust, and more suited to indexing content found upon the Web.
Google’s categorization of content through Adwords/Adsense is much more likely to still be based on the ontology approach developed by the Applied Semantics team in the early 2000s, which is definitely not LSI.
I wrote a litle about a possible Applied Semantics approach to web indexing here: Search Based upon Concepts: Applied Semantics and Google
Hi Dan,
I’ve seen a lot about Google Caffeine, and what it potentially brought to Google, and much of it seems to be misinformation more than anything else. Google Caffeine was primarily a change to the way that Google’s database stores information, utilizes server resources, and updates indexing information in an incremental fashion.
The basic thrust of it is that more information can be stored on a hard drive because that information can be maintained in smaller chunks of the hard drive, more information can be accessed because the location of information stored on servers can be identifed through more than one master server at a time, and when information is updated about a specific page, it can be done in pieces rather than updating all of the information about the page. That last part means, for instance, that if there is a new link on a page, only the section of information about links on the page needs to be updated, and not the information about the content on the page.
The value of the caffeine update is that information about pages in Google’s index can be updated faster, more information can be collected on the same services, and the processing power needed to collect and retrieve this kind of information can be done more efficiently. What this means is that Google can use their computation resources more efficiently, and use some of that computing power to try out new things.
Hi Andrew,
It’s never been a bad idea as a web publisher to use variations of a term on a page, include synonyms and related terms and phrases, include information about higher and lower level related concepts, and so on. Doing so creates a possibility that you will rank better for those related terms, synonyms, and concepts, and that Google may have an easier time of doing things like ranking your pages based upon things like categories, or reranking your pages based upon something like phrase-based indexing.
You can help yourself in intelligent approaches like that to help make it easier for the search engines to index the content on your pages. Google is getting better at understanding things like when showing search results for a synonym for a query term within the appropriate context may be helpful to searchers, but it does help to take the kinds of steps that you’ve referred to.
Hi Dan,
There was an interesting post on the Google Public Policy Blog a couple of years ago about how Google is trying to use statistical language models to improve the identification of synonyms within context, that addresses the problem that you point out:
Making Search Better in Catalonia, Estonia, and everywhere else
One issue that the mention in using language models like that is that it can take much longer to collect user data about searches. A snippet from the post:
Hi Jake,
Catagorization may be one reason why some pages stopped getting as much traffic from long tail queries, though it’s possible that the approach to categorization described in this patent may not be what caused some of the recently reported problems.
Chances are that it wasn’t Caffeine, which was an infrastructure change rather than a ranking change.
It’s just as likely that something else, such as a phrase-based indexing approach could have impacted those results.
Hi Andreas,
I’m going to disagree. slightly. 🙂
My post on Google’s virtual bookshelf plans was a good quality post, filled with useful information about online personal libraries, with links from librarians and others pointing to it, comments and discussion on it, and contained some information that is pretty much unavailable anywhere else. The quality of the post isn’t an issue, neither is the post’s relevance for the term “bookshelf plans.”
Chances are that people searching for “bookshelf plans” aren’t looking for information about virtual bookshelves, but the title and snippet shown for my post make it clear that the post doesn’t contain woodworking schematics or blueprints. The visitors who click through to that page, on a search for “bookshelf plans” probably aren’t surprised by the content found on the page, and I would guess that people interested in building bookshelves may also be interested in virtual bookshelves as well.
It’s possible that Google has reranked search results for my post, based upon some categorization approach, so that is shows up lower in search results for that query that it would based upon a text-based relevance and quality (PageRank, plus whatever other signals they might use to measure quality). There are a number of pages that show up in results before it that do detail how to build bookshelves, and it’s quite possible that those were boosted because they likely more match the intent of DIY carpenters.
If my page were to rank number one in Google’s search results for “bookshelf plans,” I would be more likely to agree with you.
I actually found when I started doing seo and marketing online that in deed there are categories and links to other words on searches. A few basic examples… financial and money or mortgages and real estate and so on.
Hi Eric,
There are some places where you may see categories in Google, such as categories assigned to specific businesses that are associated with web sites in Google Maps, or what looks like categories assigned to specific pages in Google’s content-based advertising.
And it may be easy to assign many types of pages on the Web to specific categories. But we don’t explicitly see categories assigned to pages in Google, that can indicate to us that Google might be using categories as an additional ranking factor.
I thought this might be the case sense before 2004. I wasn’t aware of this patent or any related, but it has seemed to me that it’s easier to enhance search rankings in Google when you rely on related sites or sites “in category” to build content partnerships with. Further, I’ve seen benefits closely to linking related pages within a site to each other. It seems that this method is how our own brains store data, so why shouldn’t Google? I hope I’ve understood this post.
Do you think this will bring us better search results?
I think Google have been categorizing websites for a long time. In Google Webmaster tools it is clear what percentage importance Google gives to certain keywords and that shows that Google already has collected info about what your website is mainly based around (for example “KW1 – 70%” , KW2 – 10% etc).
All google would do is take these main keywords and correlate them with a category for organic purposes which is easily done within Google’s complex algorithm.
I think the best way to make sure your website is categorized correctly is simply to make sure all onpage data is focused on the correct main keyword. All title tags, meta tags, H1’s H2’s and body text.
Wonder if breadcrumbs have a role to play in this depending on how and where they are used on the page. When you click ‘similar’ in results for the page in question, it picks ‘virtual bookshelf’ pages correctly.
I agree with what Donnie Lee has said. I have an online business with 4 sites and it’s much easier for me to know where to start with keywords and links for SEO if I search for a certain niche and see what Google “finds” for my query.
Like some of the other commenters here, I wasn’t aware of this. I’ve always based my categories in “keyword-type” format but I don’t think I receive any traffic on my sites directly from category names. It will be interesting to watch in the future and see how the search companies continue to handle them.
Hi Donnie Lee,
Those are good points. I have a followup to this post coming soon, and I’m going to explore the topic in much greater depth. I’ve been working on it for a few days, so hopefully it will answer questions that you might have.
Hi Paul,
It has been interesting seeing how Google presents information about your pages in Webmaster Tools, especially the section that shows which words appear on the pages of a site, and the frequency of their appearance. I’m not sure how helpful that is when it comes to thinking about how Google classifies pages, since the terms it shows us are single words rather than phrases, and since many categories may be best identified by more than one word.
I think focusing upon a single word or phrase is important, but I also think that it is extremely helpful to include related words and phrases upon a page as well.
Hi Marijan.
I do think that approaches like this one can help to bring us better search results, especially for pages which may not have many links to them, and much in the way of anchor text from links, pointing to the pages.
Hi Dharne,
It’s possible that breadcrumbs may be a helpful signal that the search engines could use to identify the structure of a site, and possibly hierarchical categories for a site. If the search engines do use them as a signal like that, it might not be a bad idea to use them in conjunction with a number of other signals as well though. Sometimes a site’s structure doesn’t always do the best job in helping to conceptually categorize the content of pages found on that site.
It is interesting that Google has sometimes been adding breadcrumb information in search snippets instead of a page’s URL since November of 2009, as described in this Official Google Webmaster blog post:
New site hierarchies display in search results
In the post, they tell us about one of their examples of the use of breadcrumbs, and there’s a mention of categorization:
Hi Jasmine,
I agree completely – it really helps to look at the actual search results that show up for a query term when doing keyword research to get a good idea of what appears in search results for the term, including categories, different meanings that might show up for the query terms, whether or not Google is showing synonyms or alternative spellings or universal search results such as images, videos, news, blog posts, etc.
I think it’s an essential step.
Hi Chris,
I’ve seen and worked on a number of sites that tend to rank well for the categories that they cover, and it can be nice to see that happen.
I’ll have more on categories in the very near future.
Um, should it be said, Thou shall index category pages and not tag pages to rank better, Bill?
I guess it makes a lot of sense as Google wants relevant results and categories are a good indication for relevance for a certain subject or Google search term. I have always used descriptive tags and categories for content within my own blogs but am not sure how productive that has been from an SEO point of view but has certainly not hurt so hopefully could be a positive thing with Google in the not too distant future. Thanks for sharing.
How categories are helpful in SEO strategies? I think, tags under the posts are more useful in more to the categories because it will be more useful and acts as a keywords.
Thank you.
Best Regards,
Alish 🙂
It seems like a coincidence. Just was checking some keywords that I monitor after a 10 days self imposed hibernation (from the net) and saw that its categories rather than the main domain which are ranking.
You are spot on in this count.
Good ideas.
I think the hierarchy will make the difference better.
Hierarchy of proper planning we can incorporate the variety of keywords
But you gave some great ideas that I intend to implement
Hi Lucky,
That’s something that I haven’t experimented with personally too much, but how I’ve set things up here.
It’s really easy to add a mix of tags to a blog post or page covering a wide range of topics. It’s also a good idea to try to limit the number of categories that you might assign to posts as well. I personally try to only use one category per post.
A video that Matt Cutts published in August, Do tag clouds help or hinder SEO? seems to hint that letting tags get indexed might not be a great idea, especially if you have a lot of them, and if they give the appearance of being an attempt to keyword stuff.
Hi James and Alish,
I’ve been wondering about the value of the tags I’ve been using on posts lately. Do they really help visitors and search rankings. I’m not sure if they do. I think maybe I’d rather include my keywords in my blog posts rather than in tags for the blog posts.
Hi John,
Interesting result – not sure if that is completely attributable to the process described in this patent filing. It’s also possible that the internal linking of your site may be playing a role in having those category pages rank well.
Hi amit,
One of the things that I found pretty interesting in this patent was the idea of hierarchical categories as well. The idea that Google is creating a hierarchical structure of categories that pages can be classified within is worth exploring in more detail by itself.
Bill
what do you think about using a keyword instead of “Home” in the navigation bar?
Hi Justine,
The primary purpose behind your main navigation is making sure that people can find their way around your site. The search engines rely upon that when they look at the anchor text for pages.
I think using a keyword in the link to the home page is fine, as long as people viewing that navigation can easily understand where they would end up when they click on the link.
I think that Google is not perfect and the fact that the term bookshelves plans returning results for pages that include virtual bookshelves plans information is just an example of that. No matter how advanced the algorithm gets, it will make ranking mistakes.
Hi Jonathan,
There are issues with the ways that all of the major search engines rank pages, but they do seem to be improving. Hopefully those mistakes will lessen in the future (cross your fingers).
I agree with Alish. Tags are important
What i’d be interested to know is what percentage of importance they put over each thing, because in a lot of interviews they literally say “oh don’t worry about keywords in your url” etc.
Hi Matt,
I don’t think that we will see something from Google that tells us how much weight different signals might carry when a search engine ranks pages, and I suspect that it might be nearly impossible for them to even give a somewhat simple answer to a question like that.
Chances are that some ranking signals carry more weight or less weight based upon different circumstances, and in regards to different queries as well as the possible intents behind those queries. See my post on Time to Add Query Breadth to Your SEO Glossary? for an example of where the query being used might influence how much weight some ranking signals might have.
Hi,
This is great, am already use categories and tags for sorting my content and in url like categories + title
Regards,
Irfan
Hi Irfan,
Using categories and tags to sort your content may help, but the search engine will likely look at more than just those things to try to determine a category for a page.
Bill,
This was a good read. Google is always innovating and seems to be getting closer with the instant search feature. I agree with you 100% on the last comment too. Adding a category name on a blog is a good practice but Google will be digging much deeper than just looking at a what a blog owner lists as a category.
Thanks,
David
Hi David,
Thanks. Instant search these days is often a guess at a somewhat popular phrase based upon auto completing what you’ve typed in so far. I expect at some point in the future we will see more suggestions that might include synonyms for things we are typing in and related searche queries that don’t necessarily start with the same letters.
I enjoyed this post just as much as the one you did on latent semantic indexing. I do beleive that including categories’ names in our url as well as optimising the categories metatags can be helpful.
Quote: One problem is that text matching by search engines has been taken too literally – if I search for “car mechanic,” I’m also searching for “auto mechanic,” or “automobile mechanic.” If the word “car” doesn’t appear on a page about an “auto mechanic,” under a pure text matching approach, I wouldn’t see that page in search results.
I come across people everyday who are this “purist” about it. If that were the case that would be too black and white and narrow on top of it.
Hi Lisa,
Not sure that I did a post that focused upon Latent Semantic Indexing that I can recall, but I have written about a few other semantic approaches to how search engines may work.
Including a category in a URL might be helpful to a search engine in classifying a page, but I’m not sure that it’s something that they would rely upon alone. There are too many sites that don’t necessarily include helpful URLs like that.
Hi Emmo,
There have been efforts and approaches by Google and other search engines to understand when words might convey the same meanings, especially within contexts. The word “auto” by itself might mean car, or it might mean a setting on something (an alarm clock, for example) that needs to be considered by search engines.
See my post on synonyms at Google for a look at some of the ways that they are trying to understand synonyms.