Web pages can be messy; they can have more than one topic on a page, and use templates that surround those topics adding little meaning to the meat of the content, filled with links and labels, advertising and boilerplate, copyright and other notices.
With a diversity of topics, those pages may not be easily crawled and recorded and indexed and found, by search engines and searchers.
When we think of search engines and how they work, we often break what they do down into three main parts – discovering new pages and new content on old pages, indexing content on those pages following rules that show a preference for important pages and unique content, and presenting relevant and meaningful information to searchers and their intents (or at least matching their keywords) in response to queries that they enter into a search box.
We usually don’t think of search engines as indexing parts of pages, chunks of information that might exist side-by-side with very different topics, and yet many pages are messy like that.
But we’ve had signs from the white papers and patent filings that we see from search engineers, that they might try to segment and capture information about different topics found on the same page.
Continue reading Microsoft Granted Patent on Vision-Based Document Segmentation (VIPS)
Recent Comments