There are things that we just don’t know about search engines. Things that aren’t shared with us in an official blog post, or search engine representative speaker’s conference comment, or through a publicly published white paper. Often we do learn some aspects of how search engines work through patents, but the timing of those is controlled more by the US Patent and Trademark Office than by one of the search engines.
For example, back in 2003 Google was filing some of their first patents that identified changes to how their ranking algorithms worked, and among those was one with a name similar to the original Stanford PageRank patents filed by Lawrence Page. It has some hints about PageRank and Google’s link analysis that we haven’t officially seen before.
If you want a bit of a history lesson you can see the first couple of those PageRank patents at Method for scoring documents in a linked database (US Patent 6,799,176) and Method for node ranking in a linked database (US Patent 6,285,999).
With Google’s Penguin update, it appears that the search engine has been paying significantly more attention to link spam as attempts to manipulate links and anchor text to a page. The Penguin Update was launched at Google on April 24th, 2012, and it was accompanied by a blog post on the Official Google Webmaster Central Blog titled Another step to reward high-quality sites
The post tells us about efforts that Google is undertaking to decrease Web rankings for sites that violate Google’s Webmaster Guidelines. The post is written by Google’s Head of Web Spam, Matt Cutts, and in it Matt tells us that:
…we can’t divulge specific signals because we don’t want to give people a way to game our search results and worsen the experience for users, our advice for webmasters is to focus on creating high quality sites that create a good user experience and employ white hat SEO methods instead of engaging in aggressive webspam tactics.
This week, Google was awarded a patent that describes how they might score content on how much Gibberish it might contain, which could then be used to demote pages in search results. That gibberish content refers to content that might be representative of spam content.
The patent defines gibberish content on web pages as pages that might contain a number of high value keywords, but might have been generated through:
- Using low-cost untrained labor (from places like Mechanical Turk)
- Scraping content and modifying and splicing it randomly
- Translating from a different language
Gibberish content also tends to include text sequences that are unlikely to represent natural language text strings that often appear in conversational syntax, or that might not be in text strings that might not be structured in conversational syntax, typically occur in resources such as web documents.
As long as there have been search engines, there have been people trying to take advantage of them to try to get pages to rank higher in search engines. It’s not unusual to see within many SEO site audits a section on negative practices that a search engine might frown upon, and Google lists a number of those practices in their Webmaster Guidelines. Linked from the Guidelines is a Google page on Hidden Text and Links, where Google tells us to wary about doing things such as:
- Using white text on a white background
- Locating text behind an image
- Using CSS to position text off-screen
- Setting the font size to 0
- Hiding a link by only linking one small character—for example, a hyphen in the middle of a paragraph
Google collects information about where you compute from, and provides location based services based upon where you travel. To protect this information, and to use it to protect people from spam and scrapers, Google might follow processes to protect that information and to analyze it.
Post a review from Germany about a restaurant, and then 15 minutes later from Hawaii about another restaurant, it’s spam. Drive down a highway where the cell towers collecting information about your journey are located in the middle of Lake Michigan, it’s likely spam. If GPS says you’re in NYC, and you then connect via Wifi in Wisconsin a few minutes later, spam. This information may not even come from you, but rather from others that might impersonate you.
Google was granted a patent last week which explores how they could use location based data to identify spammers and scrapers. It would also put user location information in a quarantine, and possibly hide starting and/or ending points for journeys from mobile devices to protect privacy for users, and to explore whether or not the information is spam. The location information could be used by the search engine, and that detailed information about locations to keep some information from being used in location based services, or other services that Google might offer.
Imagine the Earth broken down into a series of cells, and each of those cells broken down into a series of even smaller cells, and then into smaller cells again, and so on, in a spatial index. Each of the levels become increasingly narrow, and increasingly more precise areas or zoom levels of the surface of the Earth.
As these cells decrease in size, they increase in numbers, which has the impact of increasing the zoom level and the accuracy of areas represented in such an index. Might work good in a place like China, where latitude and longitude are banned for export as munitions. Such a set of cells might be part of a geospatial analyzing module that links specific businesses and points of interest (parks, public regions, landmarks, etc.) to specific places on this model or index of the earth. That might be one index of the businesses and one index for the points of interest, or a combined database that includes both.
Sometimes that index might include a business and a landmark within the same cell. While that could be correct in some instances, such as a shop appearing within the Empire State Building, Often its an error, and sometimes even an intentional error. People will sometimes enter incorrect information into a geographic system like this to try to gain some kind of advantage.
If people search for something like a motel “near” a particular park for instance, the motel that appears to be next to, or even within the boundaries of that part might seem to have something of an an advantage in terms of distance from that part when it comes to ranking the motel. And, sometimes Google doesn’t seem to do the best job in the world at putting businesses in the right locations at Google Maps.
When Google ranks businesses at locations in Google Maps, they turn to a number of sources to find mentions of the name of the business coupled with some location data. They can look at the information that a site owner might have provided when verifying their business with Google and Bing and Yahoo. They may look at sources that include business location information such as telecom directories like superpages.com or yellowpages.com. or business location databases such as Localeze. They likely also look at the website for the business itself, as well as other websites that might include the name of the business and some location data for the business, too.
What happens when the information from those sources doesn’t match. Even worse, what happens when one of these sources includes information that might be on the spammy side? A patent granted to Google this week describes a way that Google might use to police for such places. The patent warns against titles for business entities that include terms such as “cheap hotels,” “discounts,” Dr. ABC–555 777 8888.” It also might identify spam in categories for businesses that might include things such as “City X,” “sale,” “City A B C D,” “Hotel X in City Y,” and “Luxury Hotel in City Y.”
In the context of a business entity, information that skews the identity of or does not accurately represent the business entity or both is considered spam.
Google’s Webmaster Guidelines highlight a number of practices that the search engine warns against, that someone might engage in if they were to try to boost their rankings in the search engine in ways intended to mislead it. The guidelines start with the following warning:
Even if you choose not to implement any of these suggestions, we strongly encourage you to pay very close attention to the “Quality Guidelines,” which outline some of the illicit practices that may lead to a site being removed entirely from the Google index or otherwise impacted by an algorithmic or manual spam action. If a site has been affected by a spam action, it may no longer show up in results on Google.com or on any of Google’s partner sites.
A Google patent granted this week describes a few ways in which the search engine might respond when it believes there’s a possibility that such practices might be taking place on a page, where they might lead to the rankings of pages being improved in those search results. The following image from the patent shows how search results might be reordered based upon such rank modifying spam: