How Does Vision-Based Document Segmentation Work?
Web pages can be messy; they can have more than one topic on a page, and use templates that surround those topics adding little meaning to the meat of the content, filled with links and labels, advertising and boilerplate, copyright and other notices.
With a diversity of topics, those pages may not be easily crawled and recorded and indexed and found, by search engines and searchers.
When we think of search engines and how they work, we often break what they do down into three main parts – discovering new pages and new content on old pages, indexing content on those pages following rules that show a preference for important pages and unique content, and presenting relevant and meaningful information to searchers and their intents (or at least matching their keywords) in response to queries that they enter into a search box.
We usually don’t think of search engines as indexing parts of pages, chunks of information that might exist side-by-side with very different topics, and yet many pages are messy like that.
Continue reading “Microsoft Granted Patent on Vision-Based Document Segmentation (VIPS)”
Related Information about Entities
You’re listening to a song on the radio, and you catch the title “Wonderwall,” but not the artist who performs it. You’d like to find out more about the song, and the performer.
You go to Yahoo, and type in the search query “Wonderwall.”
Imagine that instead of just receiving a list of pages and other results that are strict keyword matches for the song, the results you received include detailed information about Oasis, the band behind the song, as well as band members Liam Gallagher and Noel Gallagher, Those results include pictures, videos, biographic information, and more, and come from pages that don’t even mention “Wonderwall.”
In the search for “Wonderwall,” the search engine noticed that the names “Oasis,” and “Liam Gallagher,” and “Noel Gallagher” all appeared frequently in the search results returned on that search. Because of that, the search engine expanded the search results to include profile information for those three frequently occurring names.
This kind of expansion of search results, to include names of people, places, events, and things found in a search for an original search query is described in a patent filing from Yahoo. While it doesn’t presently appear in use, it’s a possible approach from the search engine.
Continue reading “How a Search Engine Might Add Related Information about People, Places, and Things into Search Results”
Come gather ’round people
Wherever you roam
And admit that the waters
Around you have grown
And accept it that soon
You’ll be drenched to the bone.
If your time to you
Is worth savin’
Then you better start swimmin’
Or you’ll sink like a stone
For the times they are a-changin’.
– Bob Dylan
Can the rate of change upon web pages influence how Google might rank pages of a site?
In part one of this series, I looked at how Google’s patent on Information retrieval based on historical data focused upon Freshness.
This second part of the series explores how Google might look at content changes on Web pages, and how the frequency of those changes might influence how those pages may rank at the search engine. Keep in mind that we don’t know for certain whether Google is even using the processes described in this patent. But it is a possibility.
Continue reading “Updating Google’s Historical Data Patent, Part 2 – Changing Content”
Before becoming a co-founder of the new search engine Cuil, Anna Lynn Patterson worked at Google upon a way of looking at how often different phrases appeared together on pages on the Web, described in a series of patent applications which share a common description, with different claims sections that itemize different parts of that description.
I summarized the description from one of the patent filings in my post from December 29, 2006, in Phrase Based Information Retrieval and Spam Detection
One of the patent applications from that series, Automatic taxonomy generation in search results using phrases, which I hadn’t originally come across back in 2006, was granted today, and covers the idea of taking documents that share related phrases, and clustering them with the related phrases to provide search results that might cover a range of categories related to search queries.
Continue reading “Google Phrase Based Indexing Patent Granted”
Images on a web page can provide a chance to express ideas in a visual way that can convey a considerable amount of information, and may also add to the attractiveness and perceived quality of a site.
When search engines rank pages in search results, images may have some impact in those rankings.
A search engine might look at the captions associated with pictures, or alt text provided as an alternative for when people browse the Web without images turned on or when those browsers are using screen reading software.
Search engines might also look at text surrounding an image, especially within the same HTML container, or block or segment.
Those indexing services could also associate other content on a page with an image, including the page’s title.
Continue reading “How Search Engines May Use Images to Rank Web Pages”