How Google Might Determine Categories for Web Pages

Six years ago, Google started showing sitelinks for the top results for many queries. In a recent Google live experiment, Google started showing expanded sitelinks in search results with tabs above those sitelinks showing categories. These experimental results were written about last week by the team at SEO Consult, in the post Google’s New SERP Test: Tabular Mega Sitelinks.

In my last post, I asked the question Will Google Add Categories to Search Results, and Let You Edit Them? I didn’t anticipate Google testing categories within their presentation of sitelinks though. I did notice an interesting new version of an older Google patent published as a pending patent application on categories for AdSense-type advertisements.

The continuation patent filing had a fresh new claims section that detailed how Google might interpret the categories of pages for purposes of showing AdSense advertisements. That process might not be the exact method that Google might classify pages into categories for purposes of sitelinks, or even for the explicit categories that Google could potentially show in search results to enable searchers to limit the results of their queries based upon clicking on those categories. But it does show some possibilities of how Google might classify pages.

The patent application is:

Methods and Apparatus for Serving Relevant Advertisements
Invented by Jeffrey A. Dean, Georges R. Harik, and Paul Bucheit
US Patent Application 20120173334
Published July 5, 2012
Filed: March 15, 2012

Abstract

The relevance of advertisements to a user’s interests is improved. In one implementation, the content of a web page is analyzed to determine a list of one or more topics associated with that web page. An advertisement is considered to be relevant to that web page if it is associated with keywords belonging to the list of one or more topics.

One or more of these relevant advertisements may be provided for rendering in conjunction with the web page or related web pages.

While this patent focuses upon Adsense and search engine marketing,it does potentially have lessons for SEO on how pages might be classified for other purposes as well.

Here are a number of ways that Google might classify pages to decide which ads to show on them:

Ways that Pages May be Categorized

1. Every term on a page could be identified as a possible topic.

2. A threshold-based approach might be used so that if a term appears more than a certain number of times, it could be a topic for the page

3. Terms that appear more frequently on a page than other terms that appear less frequently might be given a higher weight.

4. Terms from the page that don’t appear very frequently on the Web might be given a higher weight than terms that appear more frequently on the Web. For example, the word “the” appears very frequently on the Web, while the word “chianti” appears much less frequently. If both “the” and “chianti” appear on the same page, the word “chianti” may be given a higher weight as a potential topic for the page.

5. Using any weighting approach for assigning weight to text found on a page, only the top scoring terms might be considered as topics for the page.

6. Anchor text pointing to a page might be used to find a category for a page. A page which is linked to with the anchor text “Travels in Italy” might be seen as about traveling in Italy.

7. The title of a page might be considered to be an indication of the topic the page covers.

8. If a page on a certain topic links to the page being classified, that link might be an indication that the pages are similar to one another and the topic for the first page might be used for the second page.

9. A top number (1,4,10) number of queries that a page ranks for in search results might be used as the topic for that page.

10. The topics of pages that are related in ways such as sharing the same directory might be assigned to the page being targeted.

11. The search query history of users who visit the page might be used to identify a topic for the page. For example, if someone searches for “italian wine,” and in the same search query session visits a page about “Travels in Italy,” the topic of that page being visited might be used to determine that “italian” and/or “wine” are potential topics for the “Travels in Italy” page.

Take Aways

It’s possible that a number of the approaches in this patent filing might be used together to come up with topics or categories for pages. For example, the number of links pointing to a page might be pretty sparse or almost non-existent, or not very descriptive (like “click here”), and may not be helpful in determining a category for a page.

Or the words that appear most frequently on a page might tend to be stop words such as “the,” or “a,” or “this,” and just wouldn’t be very descriptive or helpful as topics.

Some sites don’t do a very good job of having descriptive titles for pages, and may have the same title on every page, or extremely short titles, or extremely long and not very relevant titles.

A page selling fresh steaks might have a link to a merchant that sells wine, to give visitors a recommendation for a great meal. It’s a great use of a link, but the site selling steaks has a very different topic than the site selling wine. The topics of those pages aren’t similar enough to use the topic for the steak site for the wine site.

But if you look at many of these signals (and others that could possibly be used as well), and use them together to see how much they might be in agreement with each other, that could potentially provide an effective way to provide a topic or category for a page.

When a human being looks at a web page, they probably wouldn’t have too much difficulty deciding upon a category for most pages. A search engine needs to rely upon clues that it can decide upon mathematically.

What are you telling search engines about the categories of your pages when you decide upon page titles, anchor text pointed on pages, your choice of words on those pages, and the choices of words that you are trying to optimize those pages for?

Those decisions might impact not only the kinds of advertisements that Google might show on your pages if you run Adsense. They could potentially determine categories for your pages in Sitelinks to those pages, or even in categories that Google might show in search results.

Share

35 thoughts on “How Google Might Determine Categories for Web Pages”

  1. Pingback: How Google Might Determine Categories for Web Pages - Inbound.org
  2. Way number “3″ sounds like old school keyword density tracking.

    Honestly, I really couldn’t think of a better way to do that other than keyword density and inbound link anchor text.

    I would think that Latent Semantic Indexing (LSI) would somehow be incorporated into that process, being that keyword density is all to often over-manipulated.

    Mark

  3. I can see this as being extremely beneficial to sites that have poor architecture and are hard to navigate. Google would be helping users to navigate to the most relevant content on your site for you. Interested to see how this progresses, but at the moment I don’t see it as a game changing… unless of course they do decide to roll this out for the organic results. It seems like a natural progression; one of the problems Google and thus users run into is different kinds of intent for the same or similar phrases. This would seem to be a sensible solution to that problem.

  4. Agree with mark. The real sites experiencing boosts from this would be low-to-mid quality sites. On top of that, I could see how a TLD with oodles of authority would get better rankings for long tails.

  5. I got a Google session where I could see the experiment from France and made several queries to test it. Some big sites did not have the tabbed sitelinks, when some smaller one would get them. Obviously the site architecture is important, but an average architecture coupled with big traffic can lead to these sitelinks (search query history used here).

    I did get tabbed site links for one of the site I worked on (aladom, French website), and although the traffic is not huge (around 25k unique visitors per day), Google understood the site architecture very well – aaaaand I’m quite proud of it!

    Not sure if this type of sitelinks will be kept, but this definitely shows how Google try to understand the silos in a website.

  6. It’s crazy to see how SERP are continuously evolving… As an online marketer, a lot of these adaptions require some really good out of the box thinking.. I guess at the same time, it’s a good thing for newer businesses such that SERPs are not entirely dominated allowing different upcoming websites the opportunity to maximize visibility and exposure.

  7. Hey Bill, is it possible they’ll use this technology to categorize posts on Google+? People have been begging for a groups based on interest, and this might be a way to do it.

  8. Ouch!!! As per your #1 point, ‘EVERY term on a page could be identified as a possible topic’, that is just a bit too big for Google or any other search engine to possible tackle. Although I respect that SERP’s are always evolving in order to better serve people, putting everything into categories may prove to be simply not feasible or practical to implement.

  9. If they take the approach of #7, using the title to make a determination, then I could see this tactic being feasible. Anything other than that may cause a lot of confusion for SEO and even the search engines themselves.

  10. I like the sound of it… In many cases when doing a search, I am looking for more broad information. In a recent interview I remember Matt Cutts talking about parsing the content to really determine how to best deliver the best search results for that searcher… At least that is what I took away from it. (I wish I could remember who did the interview, but it slips my mind.)

  11. This is where it starts to get a bit crazy. I get the part about the relevant domain but Google really needs to come down off the mountain and tell us a better definition of relevant. Not for spamming but just basic planning purposes.

    For example, my site / blog focused on word game analysis. Parsing this further:

    words | games | analytics

    So…when it comes to giving and sending link juice, who can I work with effectively? Because I don’t think I’m going to find many exact match sources… which means I can start link building with people who are into words, games, or analytics…so which of these is deemed relevant?

    Do I need to pick one topic for the entire blog or can it handle a split focus (which is effectively what I have)? What if I write two articles about adsense and one about SEO – how does that factor into Google’s opinion?

    The concept of link relevance is great but gets messy quickly for a real world site..

  12. Google is quite good in determining what the content is, what the category it is, and what the site is actually about. Based on the analysis of the text and its specially Google’s content readability measuring features and other measures are very perfect in determining the type of content, what it is about and its quality in general.

  13. Forgot to mention one other idea in my last comment.

    How google determines link relevance is important to another aspect of long term SEO strategy – managing site pivots.

    Going back to our word | game | analytics example, to what extent does the link juice I acquire from sites about “word games” help me rank for “analytics” terms? This of particular relevance when you’re looking at a sharply trending topic – eg. ride the trend to get links, then pivot into supporting a money space.

    Eg. when the mobile word games craze dies down, can I pivot my link juice into a statistics site? What factors would limit this?

    I suppose a more sinister example might be paris | hilton | hotels… start by using Paris Hilton to build links during one of her many flare ups, then pivot that juice into promoting hilton hotels (assuming they have an affiliate program..)

    Any thoughts on how Google controls / faciliates this?

  14. I found some of my posts containing my blog homepage’s name appear different in search results. Instead of showing the page names (as written inside title tags), Google shows what I usually use for my homepage’s anchor texts. It’s unique, and some people linking to my blog also use the same anchor texts. Perhaps, Google “sees” something in it and decided to use the anchor text as my blog’s title shown in search snippets.

  15. Great find Bill.

    I can see #11 being an insight for ranking search results as well, as it is a strong representation of how valuable a piece of content is for a specific keyword.

    I have long thought (well since the Panda Update) that user behavior and user metrics based on a keyword or topic coupled with other value metrics have become a stronger indicator to Google as to which page should rank for which keywords.

  16. Interesting to learn about different complex connections between topics, words, etc. I mean things that go together most often. For example, yoga and vegetarianism, family and pets, real estate and moving services. How do they make such connections and do they have them at all?

  17. I see a problem if they will use the title to determine the category; It could be very misleading in many ways since not many websites (in my country) are optimized for search engines in that way which makes titles relevant for every page. Also, permalinks/subpages on websites are seldom relevant.

  18. I think that google calculate a categorie by mixing anchor texts, titles, keyword density and maybe with the neighborhood.

    Google should have a high error rate if they just use one way.

  19. It is best to stick to the proper optimization and site structure standards (optimizing for the best page titles, anchor text, and words for your niche). Great article, it is good to know what Google is up to so that you can be better prepared.

  20. John: “I suppose a more sinister example might be paris | hilton | hotels… start by using Paris Hilton to build links during one of her many flare ups, then pivot that juice into promoting hilton hotels (assuming they have an affiliate program. Any thoughts on how Google controls / faciliates this?”

    I’m thinking this is part of the long term thinking behind semantic web and microdata. Whether you get relevant/authoritative links for the term Paris Hilton or not, the sites that specify the itemtype identifiers as person or building will have the relevance advantage. In other words you can pivot from a keyword string, but you can’t pivot from a Schema Object, and semantically built object-oriented search would negate any value from the pivot.

  21. I think some sites don’t do a very good job of having descriptive titles for pages, and may have the same title on every page, or extremely short titles, or extremely long and not very relevant titles.Suppose I want to search for inverter battery of different rating in ampere hours,I got the result page showing where the item sold in the shop of particular city. If it is category wise I think it will be very much easy for SEOs to find out the page result to the satisfaction. let the Google prove the fact fruitful.

  22. Hi,

    Great blog – what can we do in the (ultra-competitive) finance industry, as a Company with an ethical approach to SEO and marketing, do to maximise on these developments – any references would be greatly appreciated.

    Best regards,
    Wendy

  23. Google is a funny beast – it looks like their process has moved closer to a “test & learn approach” based on some of the stuff I’ve seen come through webmaster tools. Look at the tail end of your search term list (eg. the lowest 30%) and you will see some wild topic swings where they decide to throw you up for 500 impressions worth of “Iraqi folk dishes” which have a couple of words in common with the terms you are targeting.

    I kid you not – I was shocked to see that one my sites must have ranked for “scrambled eggs” for a period of about 5 minutes one afternoon (and somebody apparently clicked on it!)

  24. Hi Alan,

    Google is investing in new technologies on a daily basis:

    - Driverless cars, and a ton of acquired intellectual property that makes it likely to be a technology that gets developed.

    - Project Glass, and a recent steady stream of patents that doesn’t appear to be letting up at any point in time

    - Google Fiber, providing a much higher bandwidth of broadband in Kansas City at first than any other provider can or will ever begin to provide, using underused and underutilized technologies.

    - Search technologies that include social search and real time monitoring of the Web and the World, knowledge base guided search results, concept-based search that relies less upon links and more upon the actual content of pages, authorship markup that makes it much easier to attribute content with its creators.

    - Three new patent filings from Google that provide a different approach to solar energy, new data centers that set extremely higher standards for energy efficiency than by any other technology companies, exploration of kite-like wind turbines that produce more energy in less space, an electrical grid for the floor of the Atlantic Ocean that others exploring offshore wind turbines can use.

    - Smartphone technologies that innovate in location based services, in speech recognition (See Jellybean), in the use of faster and higher data throughput at lower energy costs (and provide longer battery lives).

    - Mapping technologies that allow for the mapping of large indoor spaces like airports, museums, transit stations, shopping malls. and others.

    Google has been working on spending its money, with acquisitions of other companies, through acquisitions of patents and intellectual property, through hiring of skilled and talented engineers. If I had a significant nest egg, and a growing list of competitors in a ever widening range of pursuits, I would be careful in how I spent it and would look for opportunities to strike out and acquire companies like Motorolla Mobility when it seemed necessary or a good idea.

    I think the people behind that post haven’t been paying enough attention.

  25. Interesting article. I saw a couple of posts about tabs showing up in search results but never saw any of the results firsthand. I think tabs could be very useful and would make sense if they used a combination of all 11 of your factors.

  26. Google would be helping users to Navigate to the most relevant content on your site for you. Interested to see how this progresses but at the moment I don not see it as a game changing… unless of course they do decide to roll this out for the organic results. It seems like a natural progression; one of the problems Google and thus users run into is different kinds of intent for the same or similar phrases. This would seem to be a sensible solution to that problem.

  27. I can see #11 being an insight for ranking search results as well, as it is a strong representation of how valuable a piece of content is for a specific keyword.

  28. Google is adding category, I also saw my blog’s categories indeed by google. and great info about what type of words get frequently added by google.

  29. It makes sense that Google wants to make it easier to navigate websites – the more accurate/relevant sitelinks are, the better the chance people will actually use them. It seems like this will benefit sites that are difficult to navigate, and sites that have a more streamlined navigation won’t really be affected.

  30. HI Bill
    I agree that Google has started as a search engine and now Google is more of an analytics company.I tried Google ads once but I dint find it much helpful.Nice article once again about the anatomy of Google.
    Thanks
    Maria

Comments are closed.