How Google May Blend Information From Feeds and Extracted Data For Search Results
In Google’s search results, depending upon your query, when and where you are searching, and what your browser and search engine settings might be, you may receive a different set of search results than other folks performing a search using the same query terms.
And those results may include a mix of links and images from different data sources including Web results, images, advertisments, local business, books, products, and others.
Google’s Universal Search provides a blended mix of results which incorporate results from a number of different data respositories all together into search results.
While ads are usually segmented from other results, the remainder may be mixed together upon results pages. David Bailey, on the Official Google Blog, provided a glimpse of how those results came to be blended together in Behind the scenes with universal search. He provided an even more detailed view in a guest post at Search Engine Land titled An Insider’s View Of Google Universal Search
The blending of results is interesting, and we see blending on a smaller scale in other places in Google, such as a mix of results in Google’s product search from feeds provided by merchants and information from ecommerce web sites which has been extracted from those pages.
A patent application published recently by Google explores how data from those feeds and data extracted from templated ecommerce sites might be merged together in product search results.
Towards the end of the patent application, we are given a hint that the process described in this patent application might have implications for other types of information and data. We’re not give much more detail than the following, but it’s something to keep in mind.
It should be appreciated from the foregoing description of exemplary embodiments of the invention that numerous modifications are possible in other embodiments. For example, the invention could be used with various types of search mechanisms, databases and item identifiers (not just vendor-related), such as news, people/social networks, classifieds, etc.
Furthermore, while exemplary methods for distinguishing between item identifiers obtained using various methods have been described, those skilled in the art will recognize from this description that various other methods may be employed without departing from the spirit and scope of the invention.
The processes described within this patent filing appear to show how data from different sources can be blended together into one type of repository rather than how data from different repositories might mix together into Universal Search results. But it does seem like it solves one potential part of the puzzle – how information from feeds and extracted from Web pages through different methods can be blended together.
Methods and systems for output of search results
Inventors: Craig Nevill-Manning and Pearl Renaker
US Patent Application 20070244854
Published October 18, 2007
Filed: January 26, 2004
Systems and methods that output search results are described. In one embodiment, a search engine implements a method comprising receiving a search query, identifying a plurality of item identifiers responsive to the search query, identifying a first group of item identifiers from the plurality of item identifiers, wherein the first group of item identifiers was obtained by a first method, identifying a second group of items from the plurality of item identifiers, wherein the second group of item identifiers was obtained by a second method, and causing the output of all or a plurality of the item identifiers, comprising providing a cue to distinguish between the item identifiers from the first group and the item identifiers from the second group.
It’s difficult to tell how long a shelf life a patent or patent application might have, and whether or not the processes that we see described within a patent filing have been developed and used, or will be developed, or are just ideas that have been sketched out and may never be used. In this document, the Froogle product search is mentioned, and we know that has been replaced by Google Product Search.
Perhaps the biggest takeaway from this patent application involves a hint of how Google plans on grabbing information to be used to blend together to show in its product search. Information regarding products, and making up identifiers and attributes to be associated with those products, can be obtained by the search engine in a number of ways:
1) The information can come directly from vendors offering the items for sale in the form of a vendor feed (perhaps via submission to Google Base), which have been stored in a products database.
2) The information about products can be automatically extracted from articles or pages offering the items for sale, again stored in a products database.
3) The information about products might be automatically extracted from articles offering the items for sale found on the Web at the time of a search.
4) The information might be extracted from pages on the web using a template-based information extraction approach which is described in a google patent application which hasn’t been published yet (U.S. patent application Ser. No. 10/675,756 filed Sep. 30, 2003).
5) The information might also be extracted from pages that include products relevant to the search query by following another process that isn’t described in detail but is related to another unpublished patent application (U.S. patent application Ser. No. 10/731,916 [Attorney Docket No. GP-078-04-US, 53051/293400 filed] Dec. 10, 2003).
I’ll be keeping an eye open for those last two patent applications. The approach involving templates has me wondering how much it might be like a recent Yahoo patent application involving understanding whether a page is using a template or not, without looking at other pages on the same site to see if there is a pattern that would indicate the use of templates.
Why would a product search engine use a combination of information from both data feeds and from crawling pages to extract information. We get a sense of why in a Comparison Engines post from last year, Former DoubleClick Execs Launch ShopWiki:
Data feeds work for the sites that will send you a data feed. The problem is the long tail. If you want to be comprehensive, you have to crawl. The question then becomes should we crawl and take data feeds. If the crawling doesn’t work, we’re not going to be accurate. If the crawling works, why would we need data feeds?
I noted above that the ideas in this patent application, of taking information from multiple sources and blending it together can apply to more than just product search. Microsoft has shared some ideas for extracting information from pages in a paper that looks at both product search and an online academic library site, in an article from January, 2007: Object-level Vertical Search (pdf).
Data Extraction differs from the indexing of Web pages that we think of traditionally when we think of optimizing pages for Web search. Instead of indexing words which are associated with documents on the web, data is collected about objects, about named entities such as specific people or places or things or events. There are facts associated with these objects that have their own attributes and values.
For a product, we might have model numbers, sizes, colors, manufacturers, product ID numbers, an SKU (stock keeping number), and other information, that may be collected from a number of different sources online. We may see manufacturer descriptions and product reviews and ratings, and locations for those reviews and ratings and for the manufacturer.
For a business at a location, we may have address information, hours of operation, days open, payment types accepted, reviews and ratings and more. This information may be collected from a number of different sources, including yellow pages sites, directory type sites, other web sites, including the homepage of the business. It may come from a feed submitted by the owner of a business that has multiple locations.
If you’re not paying attention to how search engines are taking information from feeds and extracted data from different sources, you’re missing one of the growth areas of how search engines operate.