Yahoo on Using Exceptional Changes in Snapshots of the Web to Ban, Penalize, or Flag Websites

There’s a body of what could be described as folklore surrounding how search engines work. These tales, or sometimes superstitions, may have a grounding in a comment made by a presenter from a search engine during a conference, or a statement made upon a search engine blog, or just an assumption that a search engine has to work a certain way in order to do some of the things that it does.

One of these that many have taken for granted is that a search engine could notice large shifts or changes on the Web, such as a site suddenly gaining lots of lots of pages, or outgoing links, or incoming links which might increase their rankings in the search engines. I recall a Google representative at a conference I attended answering a question about how a search engine could notice such things, where he said that they could because they have “lots and lots of computers.”

A Yahoo patent application from last week, Using exceptional changes in webgraph snapshots over time for internet entity marking (US Patent Application 20070198603), provides some insight into how such changes could be flagged automatically, and also could “identify exceptional entities that exhibit abnormal attributes or characteristics due solely to their excellence and high quality.”

The abstract from the patent filing tells us:

Techniques are provided through which “suspicious” web pages may be identified automatically. A “suspicious” web page possesses characteristics that indicate some manipulation to artificially inflate the position of the web page within ranked search results.

Web pages may be represented as nodes within a graph. Links between web pages may be represented as directed edges between the nodes. “Snapshots” of the current state of a network of interlinked web pages may be automatically generated at different times. In the time interval between snapshots, the state of the network may change.

By comparing an earlier snapshot to a later snapshot, such changes can be identified. Extreme changes, which are deemed to vary significantly from the normal range of expected changes, can be detected automatically. Web pages relative to which these extreme changes have occurred may be marked as suspicious web pages which may merit further investigation or action.

The changes that might be tracked may apply to individual web pages, or to specific domains, or even hosts, and extreme changes could cause pages to “be marked as suspicious web pages which may merit further investigation or action.”

For example, a first snapshot of a network of interlinked web pages might indicate that a particular web page contains ten outgoing links. A second snapshot of the network, taken a mere week later, might indicate that the particular web page contains one thousand outgoing links. If the normal expected change in each web page’s number of outgoing links over a week is in the range of five links, then the particular web page may be marked as a suspicious web page.

When suspicious changes are noticed, there may be a number of results possible under the patent filing:

1. Pages involving those changes may be automatically eliminated from search results from the search engine.

2. The rankings of all references to those Web pages may be automatically reduced so that the pages appear lower within search results.

3. The pages are logged so that they can be manually reviewed by human inspectors, to see if the changes are the result of artificial manipulation.

The document describes the possible use of a white list of web pages, hosts, domains, and/or other entities, for sites that search engine administrators know are “popular and legitimate.” Those would be automatically excluded from identification as suspicious entities. The example given for such a site is “yahoo.com.”

A variation of this process would not automatically exclude pages from future lists of search results, nor adjust their rankings downward, but would instead result in a further evaluation based upon other criteria. The example given is that a web page deemed suspicious might be checked by a program to see if it can find words which are “usually found in artificially manipulated web pages” such as ones dealing with pornography or prescription drugs. This evaluation of content could be done in an automated manner.

It’s also possible that pages identified as suspicious based upon extreme changes over time might be used as training data to find features that suspicious pages tend to share, and can use those features to help determine if other pages are “suspicious.”

Those machine learning techniques may also be used on pages that are known to be legitimate, to help “prevent other entities that possess these features from being treated as suspicious entities.”

Thus, embodiments of the invention may implement machine-learning mechanisms to continuously refine definitions of high-quality web pages and other entities so that such high-quality web pages and other entities can be automatically identified with greater precision and accuracy. Such embodiments of the invention are useful even in the absence of the growth of suspicious entities.

Conclusion

The process of tracking changes in things such as the number of links to and from pages, and domains and hosts, and a search engine automatically taking some kind of action upon extreme and suspicious changes is something that many have probably assumed is happening with all of the major search engiines. It feels a little less like folklore, and a little more like fact after seeing the process described in a patent application.

Share

14 thoughts on “Yahoo on Using Exceptional Changes in Snapshots of the Web to Ban, Penalize, or Flag Websites”

  1. Thanks, Lucas.

    That section had me wondering what else they might be looking at, once they identified suspicious pages. Combine a method like this with the analytical methods described in papers like Spam, Damn Spam, and Statistics (pdf) and Detecting Spam Web Pages through Content Analysis (pdf) and you could have some interesting impacts.

    Of course, those are Microsoft papers, but it’s difficult to claim too much right of ownership upon that kind of statistical analysis.

  2. Another nice find Bill. I love papers and patents that indicate how a search engine might detect outliers from the normal distribution.

    Once the machine-learning mechanism has “learned” the features that suspicious web pages or other entities tend to have, the machine-learning mechanism can evaluate additional entities to determine whether those entities also possess the features. The machine-learning entity can determine, based on whether other entities also possess the features, whether those other entities are also suspicious entities. Thus, the machine-learning entity becomes an “automatic classifier.” Based on whether those other entities also possess the features, the machine-learning entity can take appropriate action relative to those entities (e.g., excluding references to those entities from lists of search results, etc.)

    Automatically being classified as a suspicious website is a good reason to not associate yourself with the wrong neighbourhood :)

  3. Pingback: Personal Brand Recognition - The new form of compensation « Personal Branding Blog
  4. Anyone who thinks that patent is only concerned with “linking to the wrong neighborhood” has missed the boat.

    Suspicious pages don’t have to link to bad neighborhoods.

  5. Hi Michael,

    Yep it’s both simpler than that, and more complicated.

    The simple part is that the search engine is just looking at changes on the Web in terms of how pages are connected together, and the rate of change.

    The complicated part is how the search engine might react to those changes, in an automated manner, with the possibility that certain levels of changes in numbers of links between pages (or domains, or hosts, or IP address) might trigger a ban, and penalty, or a human review, and learn from looking at both suspicious and high quality sites.

  6. Very interesting find indeed. In terms of the evaluation process, assuming that an extreme change was detected, I wonder if there would/could be any correlation associated between the types of inbound/outbound link relationships – and if so, how that may affect/impact the response.

    I didn’t see any indication of the rate of time between page/domain snapshots – I suppose that still leaves it open as to an interpretation of what extreme change is in comparison to time.

    Thanks as always for sharing the information and analysis!

  7. This sounds pretty scary, but just because it’s patented doesn’t mean they’ll use it – they might just be stopping the competition from doing it! :D

  8. Related Question: If links are added to a page and it disappears from a search engine only to reappear 4-6 days later higher than ever is it likely that a human worker reviewed the page and found that it passed muster? Next, if this page was approved would that mean that a template of it could be used again without triggering the same problem? As regards your post, does Google already have something like this in place?

  9. Hi Derek,

    You’re welcome. It might be possible that a next step may be to look at the links themselves during an evaluation process. Rate of time might be related in some manner to rate of crawling.

    Hi Christina,

    It’s hard to interpret the whims and actions of search engines, and unusual behavior in rankings. There are a large number of reasons why a site might fall out of search rankings, including error on the part of the search engine (database errors, hardware problems, etc.).

    In Google’s case, what also could happen is a switch to a different ranking algorithm, and then a return to a previous one or a change in the new one. It’s also possible that a different datacenter was being called by your searches, for one reason or another, and then you were returned to the previous one.

    If the example applies to your pages, I’m glad to hear that your rankings returned stronger than before.

    I would guess, from the way that I’ve heard Google search engineers talk about their indexing methods, that they would have something like this in place.

  10. Thanks, Bill. Yes, this did happen to a new page I put up. It is a page that may have triggered a spam filter. In my field, simple is best, but sometimes it looks like spam even though it isn’t. I posted five lessons with quizzes and named them California Lesson 1, Quiz 1…California Lesson 2, Quiz 2…. All the way to up to lesson and quiz 5. The page went off Google shortly after the change. I waited about 7 days and luckily it came back on. It’s hard to create an education site without using the old-fashioned numbering system for units and chapters; it’s easy, clear, logical, and sensible, but probably sets off alarms.

    What do you mean by datacenters? Are they search results that are designed for various locations?

Comments are closed.