How a Search Engine Might Use Statistics to Identify New Ranking Features

I may have been a little unusual as an English major in my college days. I remember one professor asking me what I found interesting about a particular author we were studying, and my answer was about patterns involving the language that he used, and how he tended to frequently use certain words that were no longer much in fashion these days. He asked for an example, and I pointed out the use of the word “singular.” I could tell that he found my point a little odd, and I wish that the Google Books N-Gram Viewer was around back in those days to back up my statement . As a side note, I wish I could have taken a class or two with HITS algorithm inventor Jon Kleinberg, who probably would have appreciated my response.

I point that out because I recall some unusual phrasings by search engineers at a large search conference I attended a few years back where most of the search marketers were using the term “ranking factors,” and all of the search engineers who gave presentations and participated in question and answer sessions instead used the term “signals.” I wasn’t the only one who noticed the phrasing, and someone called one of the search engine representatives on his use of the term, upon which a Google representative responded, and was seconded by the Yahoo and Microsoft reps, that they preferred to use the term “signal” instead of “factor.”

Much like in my college days, I find myself a little obsessed with the language used in the search patents I read. If Google would point their N-Gram viewer at the USTPO’s database of patents, that would be a great thing. There are a few terms that I keep on seeing spring up in some Google patents that I’ve been finding pretty interesting lately.

One of those is “features,” which seems to be used sometimes interchangeably with “signals” in some discussions from Google when it comes to discussions about data they collect. Another term I’ve been seeing increasingly on some Google patents and papers is the term “large data sets,” which may refer to the amount of Gmail that Google processes, or click-stream information on advertising clicks, user-behavior when it comes to mining browsing histories or search query logs, or even different items that might be found on Web pages when it comes to creating a quality score for those pages.

A Google patent granted today describes how Google might manage information related to large data sets, but even more importantly discusses how statistics might be used to identify new features for the search engine to track.

The patent is:

Scaling machine learning using approximate counting
Invented by Simon Tong and Noam Shazeer
Assigned to Google
US Patent 8,019,704
Granted September 13, 2011
Filed: May 12, 2010

Abstract

A system may track statistics for a number of features using an approximate counting technique by: subjecting each feature to multiple, different hash functions to generate multiple, different hash values, where each of the hash values may identify a particular location in a memory, and storing statistics for each feature at the particular locations identified by the hash values. The system may generate rules for a model based on the tracked statistics.

In the description’s overview of the patent, we’re told that the kind of data repository covered by this patent might involve over a million data elements, and could include information such as:

  • E-mail data (e-mails that users sent and/or received,)
  • Advertisement data (advertisements presented to the users and/or selected by the users)
  • Any type or form of labeled data.

The data contained in one of these repositories might be used to create rules for a model. For instance, the data might include data about spam and regular (non-spam) e-mails, that could be used to create rules for a model that may predict whether future emails may be spam. User data involving advertisements might be used to predict whether or not someone might click upon a particular advertisement

An example of one type of statistical information involved in this process might involve feature counting, where the number of instances of a certain type of feature showing up might be counted. If a certain feature tends to show up more than a certain threshold, it might then be promoted so that it could be used to form rules for a model. That feature might be added to other features used in a model, or might replace a previously used feature.

The use of that feature as a rule in part of a ruleset consisting of a number of features might be used to label emails or advertisements or possibly even webpages to a predicted label.

Conclusion

Is this how Google’s Panda works?

It’s hard to tell for certain, though it does seem like the approach behind Panda is to identify positive and negative features (signals) that might predict how positive a user experience might be upon a particular page or site.

Share

13 thoughts on “How a Search Engine Might Use Statistics to Identify New Ranking Features”

  1. Hi Bill,

    I’ve just had far too much fun with Google’s NGram viewer.

    I’m a bit of a language geek anyway, I’ve guilty of slipping fricative alliteration into presentations when I’m trying to hurry along or throw in some plosives when I’m not happy with a service i receive.

    Aside from looking into the English Vs. American spellings of words (Or – International English as described by Adobe which i find particularly irritating) where the English spellings seem to be losing hands down – it’s interesting to look at the rise of specific keywords associated with technology trends.

    I can tell I’m my productivity is going to decrease this afternoon.

    Looking into the patent though – this seems pretty significant.

    From my understanding the association of data types is based upon categorisation as indicated by the features, signals or factors of each corresponding unit. This is the trend that seems to be growing back from the recipe organisation through to patent covering the way in which Google handles large amounts of data.

    As the categorisation is then sorted further into unique buckets and it suggests that the defining numerical values are placed against each corresponding unit which is in turn categorised almost infinitely.

    The interesting point is perhaps the part that corresponds to the term “signal”:

    it would suggest that by cross correlating the vector value against it’s peers each unit can be reattributed or adjusted. Not necessarily recategorised but the category itself can be adjusted. So rather than defining the factor involved in the categorisation the data set’s indicative values (or signals) can change the interpretation of the data set and units of data correlated with a similar indicative value or signal.

    So in the hypothetical instance that this is a blunt tool – if my correspondence was deemed dominant in a very small data set of people that use a large amount of explanation marks and yet my correspondence was pertinent enough to command a high open rate – then theoretically if the vector value against my correspondence correlated strongly with someone who emailed unsolicited communications who also happened to use a large number of explanation marks – the signal could be interpreted as explanation marks correlate to desired communications and the resulting action could be the prioritisation of people that use explanation marks.

    This is purely my initial understanding and I’m certainly open to being corrected!

  2. Aside my limited understanding, can anyone explain in SEO/layman’s terms what an N-Gram is? My google searches show many pages with crazy math calculations!

    So in a line, what is an n-gram?

  3. WB, an “n-gram” is a sequence of N contiguous letters or characters selected from a larger sequence of contiguous letters or characters. So in the sentence I just wrote, a trigram (3-gram) might be “seq” or “uen” or “con” or “let” or “cha”. A quadgram (4-gram) might be “lett” or “sele” or “sequ”.

    Some technology papers and patents describe a process of “windowing” or “sliding” where the N-grams are viewed by starting with the first character in the entire document, then going to the 2nd character in the document, then going to the 3rd, etc.

    Some researchers suggest that a 9-gram viewing window may be optimal for many different kinds of pattern analyses. I’m not sure what current theory in the field teaches.

  4. Actually, the “grams” could also be entire words, not just characters. Or they could be groups of words. I should have made that clearer. Sorry.

    A “2-word-gram” from my original reply might be “is a” or “letters or” or “selected from”, etc.

  5. Hi Tom,

    The N-Gram viewer is a lot of fun.

    This patent does seem to fit in pretty well with the Google patent Ranking documents based on large data sets, which I wrote about in the post that I linked to above with the words “large data set.”

    This statistical mode could use n-grams, but it’s just as likely to use other statistical models as well, involving specific features that might be associated with different data sets as well, including the triplets of data from that other patent.

    The example that you’ve used of correspondence that includes multiple exclamation points (likely a negative feature), but provides a relevant and effective response could not only be labeled overall as positive, but could also help to redefine how this system might look at emails with many exclamation points.

    Of course the patent does go beyond emails to large data sets involving clicks on advertisements, or large data sets involving web pages, or even large data sets involving videos.

  6. Hi WB and Michael,

    Good answer from Michael.

    Just to add a little. The following line might appear in a web document, email, or other kind of document:

    “The quick brown fox jumps over the lazy dog”

    The search engine might be collecting ngrams that are 4 words long from the document. In collecting them, it would overlap the n-grams that it collects like this:

    – the quick brown fox
    – quick brown fox jumps
    – brown fox jumps over
    – fox jumps over the
    – jumps over the lazy
    – over the lazy dog

    This blog post from the Google Research blog provides a little insight into how n-grams are being used by Google:

    http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

    It starts with the following snippet:

    Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others.

  7. Hi Michael,

    I believe that the patent has greater application than to just emails, even though the description provides a number of examples related to email.

    From the patent:

    For example, the repository may include e-mail data, advertisement data, and/or other data indicative of user behavior. In one implementation, the data in the repository may be obtained from monitoring user behavior (e.g., the e-mails that users sent and/or received, and/or the advertisements presented to the users and/or selected by the users). User behavior may be monitored with the users’ consent. In another implementation, the labeled data may include any type or form of labeled data.

    The large data set patent, which shares a couple of inventors with this one explores a statistical approach that more explicitly notes an application to ranking web documents, but this one focuses upon how statisitical data might be collected and used to change a model.

  8. It might only be about email now but I’m sure google (if they are not already) will start to use this type of technology to further improve the quality of the search index. Think about it, why wouldn’t they? there are all sorts of positive or negative ‘signals’ I get every time I look at a website. This comes from my years of experience looking at websites and being able to tell a spam site from a real site. It’s bound to be only given a small amount of weight overall so as not to inadvertently knock out legitimate websites.

    One thing I have noticed lately is how some of my sites get put in the sand box and others don’t. The sites that get put in the sand box are usually light on detail in the contact us page, no google maps page, no privacy policy etc etc. The sites that totally avoid the sandbox have a more legitimate business feel to them. They contain business registration details, phone numbers, copyright symbols etc.

    Based on what I have seen I would say Google is already using this type of technology to search for ‘signals’ especially on new sites.

  9. Hi Simon,

    It’s really hard to say, because when we hear about things like these approaches, it’s often hard to pick out exactly what things have been influenced by something like them.

    For example, Google flatly denies the existence of a purposeful sandbox as something that might have been put into place to seemingly penalize new sites, but they have admitted that one of the algorithms that they released has a “sandbox-like” effect.

  10. Hi Bill

    Yes I’ve also read that Google denies a sandbox for new sites. I can confirm though with absolutely no doubt that they have been able to pick my sites with more spammy look and feel with very high accuracy. I guess you could say just stop making spammy sites. When I say spammy I don’t mean totally useless sites. I mean they are built to supplement an existing and successful site, perhaps targeting a more specialized niche area than the main site with the idea being to generate additional leads.

    I know there is some type of filter being applied because the only keyword the site will rank for is its domain name (including the .com) Without it, the site does not rank for anything even if I select 20 words in a sentence and search for that (without the quotes).

    Sandbox-like effects definitely exist for some new sites. I start new sites quite often so I am trying to put my finger on what influences it one way or another.

  11. Hi Simon,

    The problem is that Google could come up with a change that looks like it might be penalizing new sites, and the impact of that change might be to penalize some new sites but not all of them.

    For example, let’s say that Google decided to give more weight to links from older sites, so that a link from a page in the Yahoo Directory might carry more weight than a link from a page in a directory that’s only 8 months old, even though the pages the links are on may have the same PageRank.

    Now let’s say that Google also decided that it would be a good idea to give less link value in links from one site to another if it could be determined that the sites were affiliated in some manner (things like the same owner or corporate owner, or involved in some type of partnership).

    That’s just an example, but not an unreasonable one given a few patents from some of the search engines granted in the past.

    If you publish a new site, you do go through a cold start stage, much like a physical storefront opening at a new location without many connections in the neighborhood it opens in. It may take a while to gain some links from older sites. If the owner links to it from sites that it controls that might have been around for a while, those might not provide much PageRank as well.

    The individual reasons for those patents wasn’t to penalize people starting new sites. The first one tried to give more value to links from older sites on the assumptions that those links were more trustworthy. The second one tried to keep links from older sites from passing along too much PageRank to other sites that they own so that other sites could compete as well. The combined effect though could keep sites that are new from ranking well for some period of time until they acquired links from some older sites that they weren’t affiliated with, even though that wasn’t the result that search engines may have necesarily wanted to see.

    Again, those are just two examples, and it’s possible that other things were the reason for a “sandbox-like” effect. I think that effect is an unintended consequence of other things, but one that Google doesn’t seem to want to back away from.

Comments are closed.