When Google crawls the Web to collect information about objects or entities, it also collects facts about those entities. These facts are separated into different categories or attributes associated with those entities. For example, a book may have attributes such as an author, a publisher, a year published, a web site it can call home , a genre, and more.
Identifying Entities by their Attributes
A search that includes those attributes can be used to identify the entity the attributes might be associated with.
Google was granted a patent recently that describes how those attributes could be searched within an attribute data store to find the entity. The patent shows how the process described within it might be used to answer some complex queries, and some interactive Answerbox type queries. The issue that this patent addresses can be summed up in a single question:
Years ago, I started referring to search results as recommendations, seeing how they’ve been starting to look more and more like that part of a page at Amazon that says “people who viewed this book also looked at these books.”
When someone searches at a search engine, one of the things they look for in the search results they receive are trustworthy pages (or recommendations) that look (and are) legitimate. How does a search engine deliver pages that are trustworthy?
One way to do that might be to try to boost pages in search results that the search engine feels are more trustworthy – and Google developed a version of Trust Rank to do that with. The inventor of Google’s Trust Rank (which differs from the version that Yahoo invented) is Ramanathan Guha.
As part of the regular business analysis that I do on an ongoing basis, I like to keep an eye out for acquisitions made by search engines, and look at the technology that those companies being acquired have filed patents for.
When I heard about Google’s acquisition of Skybox, I jumped to the assumption that low-level orbiting satellites might be used in a manner similar to Google’s Project Loon to spread internet access to a wider audience across the globe. Or they might be used to make Google Maps a lot better with high resolution and frequently updated satellite images.
And then I looked at the patent filings assigned to Skybox Imaging, and quashed those assumptions, or put them off as secondary reasons why Google might have purchased the satellite company.
How much of an impact might high resolution and very frequently updated satellite images have upon a business analysis?
Most of us searchers and site owners and search engine optimizers are familiar with Google’s Link Graph, and how Google uses the connections between websites to help in ranking pages on the Web. In part, Google looks at the relevance of the content of a page compared to a query that a searcher enters at the search engine.
In addition to “relevance”, Google also uses the patented method of PageRank, in which the quality and quantity of links pointed to a page are used as a proxy for the quality of the page being linked to. The higher the quality of a page (and the higher PageRank it possesses), the more PageRank it likely passes along.
The link graph is one example of how Google ranks and measures and possibly sorts web pages. Another that Google might look at is the attention graph – how Google might use topics and concepts that may be searched upon frequently to change rankings of pages based upon freshness and hot topics.
When Google indexes the Web, it’s often been convenient to think about the search engine running two different methods or approaches that seem to run in parallel. One of those involves the crawling and indexing and ranking of pages on the web (and images, videos, news, podcasts, and other documents).
The other approach doesn’t look at pages as much as it indexes objects it finds on the Web, or what we often refer to as named entities, which are specific people, places, or things – real or fictional. We see this second kind of crawling often referred to as fact extraction and see the results of such extraction as Knowledge Panel results or even things like Google’s OneBox Question & Answer results.
When SEOs talk about Google and the programs it uses to crawl and index pages on the Web, we usually refer to those crawlers as robots or spiders or even Googlebot, and don’t differentiate these crawling programs much. Not the kind of robot above (which is a new twist from Google), but it’s probably time to start thinking of Googlebot differently.
There are things that we just don’t know about search engines. Things that aren’t shared with us in an official blog post, or search engine representative speaker’s conference comment, or through a publicly published white paper. Often we do learn some aspects of how search engines work through patents, but the timing of those is controlled more by the US Patent and Trademark Office than by one of the search engines.
For example, back in 2003 Google was filing some of their first patents that identified changes to how their ranking algorithms worked, and among those was one with a name similar to the original Stanford PageRank patents filed by Lawrence Page. It has some hints about PageRank and Google’s link analysis that we haven’t officially seen before.
If you want a bit of a history lesson you can see the first couple of those PageRank patents at Method for scoring documents in a linked database (US Patent 6,799,176) and Method for node ranking in a linked database (US Patent 6,285,999).
Does Google’s newly granted patent co-invented by Navneet Panda describe Google’s Panda Update?
Search Quality vs. Web Spam
Many of the patent filings that I’ve written about from Google address Web Spam issues, and how the search engine may take steps or follow approaches to keep its search results from being manipulated. An early example of Google tackling such issues is their patent filed in 2003 titled Methods and systems for identifying manipulated articles.
But many of the patents I’ve written about involve ways that Google is trying to improve the quality of search results that searchers see.
One of the most impactful updates at Google was the Panda Update, released into the world in February of 2011, and affecting almost “12%” of all search results. In a Wired interview of Google’s Amit Singhal and Matt Cutts, TED 2011: The ‘Panda’ That Hates Farms: A Q&A With Google’s Top Search Engineers, the name of the update was revealed to be taken from a Google Engineer that played a significant role in its development:
Wired.com: What’s the code name of this update? Danny Sullivan of Search Engine Land has been calling it “Farmer” because its apparent target is content farms.
Amit Singhal: Well, we named it internally after an engineer, and his name is Panda. So internally we called a big Panda. He was one of the key guys. He basically came up with the breakthrough a few months back that made it possible.