Better Document Classification
Last week, I wrote about a patent granted to Google which described how the search engine may use categories as a search ranking factor to decide whether or not to include some pages in search results for specific queries. The patent was originally filed back in 2004, and focused primarily upon document classification based upon things such as the contents of web pages and anchor text in links pointing to pages.
A few days ago, a new patent application was published by Google which focuses upon document classification using a wider range of information, including user behavior data. Instead of a simple matching of weighted classifications between web pages and queries, the patent filing describes a way of creating profiles for pages which include classification information and spreading that document classification information to unclassified pages through query profiles for queries which both types of pages rank for in search results.
This kind of user-data based profile information could be used along with more conventional ways of ranking pages to improve the quality of search results and to provide more personalized results to searchers. The patent application is:
Generating Improved Document Classification Data Using Historical Search Results
Invented by Bilgehan Uygar Oztekin and Pei-Wen Andy Chiu
US Patent Application 20100262615
Published October 14, 2010
Filed: April 8, 2009
When a search engine receives a query from a searcher, it may collect a fair amount of information about that query term or phrase. That can include:
- Words or Phrases used in the query
- The search results shown previously for the query
- Impression data
- One or more information retrieval (IR) scores of the search results
- Position data of the search results (indicating the order of the displayed search results)
- Click data of the search results (user selections of the search results)
- User navigation statistical data for the search results (the ratio between the user selections of the URL and the user selections of all the URLs in the search results for the same query during a particular time period, such as the week or month preceding submission of the query)
- Location information (e.g., city, state, country or region) for the searcher
- The language of the query
This information could be collected in a database for all searchers, or it could be partitioned for specific groups of users of the search engine, such as all searchers submitting queries:
- In a particular language (e.g., English, Japanese, Chinese, French, German, etc.)
- From a particular country or other jurisdiction
- From a certain range of IP addresses, or
- From some combination of the above
User Profiles
The search engine also collects information about the people submitting those queries, and stores the information in a “user profile database.”
Each user profile may include multiple sub-profiles which could be broken down into different interests. A user profile can cover a group of users, such as people who all access the search engine from a particular computer, or from a particular web site or web page. The user profile database may use the search history of use to determine that user’s search interests.
A user profile record can include things such as favorite topics, and preferred ordering of search results.
Query Profiles
A query profile, created from the collected information may include such information as:
- A particular query
- the set of corresponding query terms in the query, and
- a category list for classifying the query
Categories assigned to web pages are similar to the categories that I described in my post on the earlier patent from Google. For instance, a category might be a news item, it might involve sports or travel or finance. A page might be given a weight involving how much it actually fits into those categories. Categories are also assigned to queries, and they are also given a particular weight for those categories.
For example, the search term “golf” could have a high weight for the categories “sports” and “sporting goods,” and very low weight for the category of “information technology.”
Document Classification Profiles
Document profiles might be created for web pages or web sites, or other objects on the web such as videos or news items or blog posts.
Information collected about documents can include their URLs, attributes associated with the pages such as URL text, anchor text pointing to the page, content on the page, page rank, and others. A category list for classifying the document can also be included in the profile, which includes the category itself, and the category weight for the page.
Spreading Document Classification Data to Unclassified pages
When Google classifies a web page, it is actually creating a category list for that page. The category list may include more than one category and includes a weight for each category listed.
There are many pages within Google’s index that either haven’t been classified or may have been misclassified. One of the focuses of this patent application is upon “spreading” the classification data for classified pages to those unclassified pages and to create more accurate classification data for pages.
Spreading document classification data usually involves two steps.
The first step is to spread document classification data from classified web pages to queries that are related to both classified and unclassified web pages.
The second step is to then take the classification data from those queries and spread it to unclassified pages.
That is why creating a query profile is an important part of this process.
Pages that originally have classifications have received them possibly from looking at initial estimates of web pages’ relevance to different subjects or topics or concept clusters. These estimates, or “sparse vectors” as they are referred to in the patent filing, can involve things such as an analysis of a web page’s content, key terms, and/or links.
Document Classification Example
A page about the Cincinnati Reds baseball team has been previously classified by Google after Google looked at the content of the page, key terms appearing on the page, and the links that point to it.
A document profile for the page shows how often it appears in different search results for specific queries, which position it tends to appear at for each of those queries, how often it is clicked upon, what its PageRank might be, which countries searchers who visit the page might be from, and what their preferred language might be, as well as other information.
Imagine that an unclassified page about the Cincinnati Reds also appears in a number of the same search results for the same queries as the classified page.
Document classification information from the classified page about the Cincinnati Reds may be spread to the query profiles created those for those shared queries (as well as for other queries the page may rank for). The classification information in those query profiles from the original classified page may then spread to the unclassified page.
The weight for those classifications for the unclassified page, possibly such terms as “baseball,” “major league team,” and “Cincinnati reds,” may be based upon how relevant the unclassified page might be seen to be for the queries used to find it, including search position data and click data.
Categories and Personalization
Information kept about a searcher’s web history, including browsing and searching, can be used to identify categories that a searcher might be interested in. The profile document classification information about queries and documents could be used together with that searcher’s profile to boost some search results.
Conclusion
While I’ve shared some specific details about the kinds of information that might be used to create document and query profiles, the patent goes into considerably more depth about how this profile and classification process may work. I write “may,” because like most patent filings, it’s possible that the search engine might use somewhat different approaches than what they’ve published in a patent application.
What’s important about this patent filing is that it describes how actual user-based data might be used to help make decisions about how pages may be classified by the search engine. The older patent on the classification of pages from Google, that I mentioned at the start of this post, appears to have evolved to consider how people search and select pages, and how aggregated information about searcher behavior can be used to classify those pages to improve processes like personalized search.
Hey Bill, do you think Google could be using its toolbar and own Analytics to better establish user profiles?
That’s a lot of data to associate with each document, on top of the information they already store concerning words on the page, tags, links, etc. And then there’s the data for users and queries, which would be cross-referenced with the page data. If they’re capable of doing this now, I’m guessing the Caffeine upgrade was a necessary part of making it possible.
Hi,
Bob they already have alot of the information. Just take a look at how much data webmaster tools gives out these days.
Jeremy – I would imagine they use the toolbar (guess) but Matt Cutts has said they definitely do not use Analytics in any way for search, ranking or spam.
Bill, Really interesting post. There seems to be a big momentum shift towards local search and browser data. There have been lots of Google patents talking about it in the last few months. I think this kind of information is going to take over from inbound links as the main ranking factor eventually or at least to some degree.
Kind regards,
James.
Bill, thanks for the break down (as usual). As someone who grew up in Cincinnati and loves baseball I always love when you use the Reds as an example….they exceeded expectations despite being swept by the Phils.
Keep up the great work!
Hi Jeremy,
I do think that Google is using things like toolbar information, web histories, and analysis of their query log files to improve the quality of the user profiles that they create. It’s not easy to get people to create explicit profiles, where they do things like list their interests, so Google focuses upon what they refer to as “implicit” profiles, where they try to learn about interests by looking at actual search and browsing history.
Hi Bob,
Those are very good points. We do know that Google has been collecting a lot of information about the way that people interact with search, and with the web for a number of year, possibly more information than they collect about web pages themselves. I do think the Caffeine update was essential for them to continue collecting and maintaining this kind of user-behavior data, to be able to create individual profiles for pages, for queries, and for users.
The conversation between Kirk McKusick and Sean Quinlan on ACM Queue describes how Caffeine makes it possible to access and store considerably more information on the same storage devices by using distributed masters and smaller chunk file sizes (also discussed in Google’s patent – Large-scale data processing in a distributed and parallel processing environment).
Caffeine also allows Google to make incremental updates in its index (see the Google patent Document treadmilling system and method for updating documents in a document repository and recovering storage space from invalidated documents), which means that for instance, information about links pointing to a document, or other changes about a document can be updated without having to update all information about a specific page.
Hi Adam,
Thanks. I’m already looking forward to baseball next year. The Reds have developed a nice core of young talent, including what looks to be a pretty strong minor league system. I think they will be in the playoffs again next year, and maybe go a little further.
Thanks for bringing this one Bill. G is indeed going towards more of a personalized result they brought that first to Google Adsense and overtime I think it will be totally visible in their search results.
I think we’re all very interested to see how they implement this into future search results. We could see a lot of changes that need to be made to SEO in the future to adapt.
Talk about taking it apart and putting it back together thanks for the info bill,im pretty sure ill be coming back.
Google not only appears to be moving the direction of offering a more personalised search but it is also increasingly focusing on providing localised search results. presume that a personalised search would begin by offering the user ‘typical results’ and then allowing the user to refine information. The key for website owners would be to get a foot in the door and it might turn out to be a case of ‘making a good impression’ the first time round.
This does leave some privacy issues unresolved (like most things online, right?). Nonetheless, user data aggregation has made search technology what it is today and I suppose there’s no turning back now.
I am consistently blown away by the possibilities which come from the patents filed by G. Thanks Bill!
Hi Ernest,
Thank you.
The thing that amazes me about this approach is the amount of information that Google may be collecting to create these profiles for queries, searchers, and web pages. It has to be many times the size of the web itself.
Hi Chris,
Chances are that a lot of what I’ve described about this patent filing is already in use in some manner by Google already. It seems to fit into what Google appears to be doing with personalized search.
Hi TrueDAD,
Thanks.
Hi Ben,
Good points. There’s a good chance that Google reranks search results based upon both personalization and location, and that there’s some level of churn in what is shown in those results so that sites that aren’t being reranked based upon those factors have some chance of being seen and interacted with by searchers. Making a good impression, possibly through a compelling enough snippet to click upon, and having pages that people seem to find value in (either by spending some time on those pages, or seeming to start a new search, or perhaps bookmarking the pages) may be a key towards, as you note, “getting their foot in the door.”
Hi Dan,
The second that Google started collecting data to “improve the performance of our search engine,” was the point where privacy concerns began to be an issue. Ideally Google attempts to obscure actual identities of searchers who perform searches, but it’s really impossible to tell.
A recent article noted that Google has been asked to turn over search histories to the US government more than 9,000 times in the past year. See:
Google CEO Would Like to Remind You How Important Your Disappearing Privacy Really Is
Hi Barry,
I am, too. You’re welcome.
Excellent Examples and a great overall article! Thanks, this will help me out a lot. I will be sure to share this one.
I this this new Classification will have a big impact to the SEO’s world. I have few sites affected by the latest Google changes. Almost all have problems in SERPs. The long tails seems to be affected.
Thank you, Dave.
Hi Brown,
There are a few potential reasons that sites may have been impacted by Google changes that have been happening roughly since around May 1st. It’s possible that the kind of profiling of sites, queries, and users described in this patent filing may play a role, but there are other potential suspects as well, such as phrase based indexing. Both are worth exploring.
I wish we saw more of this once the patent was filed, because this could of been (and could be) integrated nicely with google+’s social metrics. Query & User profiles would of helped classify search results, grouping things together, and in my opinion would of been a huge step for search. But, yet again another great patent on an idea that never materialized…
The document profiles you wrote are likely to be connected to the new “related” toolbar extension they’ve released for Chrome and IE.
Hi Jonathan,
I suspect that Google has been doing this kind of classification and reclassification of web pages, and it could account for some interesting fluctuations in rankings that I’ve seen reported at a few web forums.
It does seem like something that could be integrated nicely with social signals.
Chances are very good that Google is constructing profiles for users, queries, and pages, and the interactions and associations between all three. The social profiles from Google Plus (and the interactions between users of the service) provides Google with the opportunity to more fully develop user profiles beyond observing where they browse, and what they search for.
Hi Bob,
It’s possible that Google might use those profiles, but until I read your comment, I didn’t put together the new Google related with another patent that Google recently came out with that I should maybe blog about if I have the time.
I wrote something about Google Related in the post Did Google Acquire the Tech (or Patents) for Google Related from Northbrook Digital LLC?
A couple of weeks ago, Google was granted another patent that seemed to describe “related” news stories in Google News, but the process described in the patent looks like it could potentially be used to identify related news stories and related websites for Google Related as well:
Document ranking using word relationships.
It would be really interesting if Google is using the process described in this patent, and it would potentially recommend pages that are more related than ones that might be tied together by a common query and category, and/or document profile. That might not apply to the “similar stores” aspect of Google Related, but it could for related news and related pages.