How Search Engines Can Index Pages in Parts

Web pages can contain a lot of information about various types of objects such as products, people, papers, organizations, and so on. Information about those objects may be spread out on different pages, at different sites.

For example, a page may host a product review of a particular model of camera, and another page may present an ad offering to sell that model of camera at a certain price.

One page might display a journal article, and another page could be the homepage for the author of that article.

Someone searching for information about the camera, or about the author may need information contained in both pages. They may have to use a search engine to locate multiple pages, to find the information that they need.

If there were a way for a search engine to automatically identify when information on different web pages relates to the same object, that might be helpful to searchers in a number of ways.

Product searches would benefit in helping shoppers find information and prices about goods that they might want to purchase. A repository of scientific papers could gain by providing additional information about the authors of papers.

Extracting Object Blocks from Pages

A patent granted to Microsoft today explores the concept of extracting object blocks from pages, so that information that appears on pages about specific objects can be grouped together.

Often, information about a specific object is presented on a page with information about other objects. The Microsoft process would break information from a page into different blocks relating to the different objects found on that page.

It might look at visual features of a page, such as different font sizes and separating lines, to help identify object elements. It may search for elements within each object block that shows that the block involves a particular object.

After a page is broken into object blocks, it might attempt to label elements of each object, and information about the objects may be stored in a data base for objects, along with other information blocks from other pages that involve the same objects.

Method and system for identifying object information
Invented by Ji-Rong Wen, Wei-Ying Ma, Zaiqing Nie
Assigned to Microsoft
US Patent 7,383,254
Granted June 3, 2008
Filed April 13, 2005

Abstract

A method and system for identifying object information of an information page is provided. An information extraction system identifies the object blocks of an information page. The extraction system classifies the object blocks into object types.

Each object type has associated attributes that define a schema for the information of the object type. The extraction system identifies object elements within an object block that may represent an attribute value for the object.

After the object elements are identified, the extraction system attempts to identify which object elements correspond to which attributes of the object type in a process referred to as “labeling.” The extraction system uses an algorithm to determine the confidence that a certain object element corresponds to a certain attribute.

The extraction system then selects the set of labels with the highest confidence as being the labels for the object elements.

Identifying object information on a web page

An advertisement on a web page for a camera may be an object block and the matching object is the specific model of camera.

An object block that advertises a camera may be classified as a product type, and an object block relating to a journal paper may be classified as a paper type.

Each object type has associated attributes.

A product object type may have attributes of:

  • Manufacturer,
  • Model,
  • Price,
  • Description, and;
  • so on.

A paper object type may have attributes of:

  • Title,
  • Author,
  • Publisher, and;
  • so on.

When a page is broken into blocks, an information extraction system might attempt to associate attributes of an object with values from the block.

So, for a camera, “Sony” might be identified as a manufacturer attribute and “$599″ as a price attribute.

Blocks and Objects

About a month ago, I wrote about how Microsoft described breaking pages down into parts and deciding which parts of those pages where the most important parts, in Search Engines, Web Page Segmentation, and the Most Important Block.

Microsoft is exploring indexing information at an information block level, instead of a page level. This patent brushes upon the idea of breaking apart the content of a page based upon the HTML code of the page and the visual aspects of how information on a page is presented, which it could do using a VIPS: a Vision-based Page Segmentation Algorithm (pdf).

The ideas in this newly granted patent take that concept a step further, and discuss the exploration of different segments on pages for information that relates to different objects, extracting information from those segments about different attributes or aspects of those objects, and relating those different objects together in the same data store, so that they can be accessed by people.

There are a few white papers from Microsoft that explore these ideas more fully:

Microsoft has produced a couple of examples where they are clearly using this kind of object level searching at:

In 2006, Google described the use of visual segmentation of content on pages, and extraction of information from different segments, within the context of grabbing information from pages offering multiple reviews for local search, which I wrote about in Google and Document Segmentation Indexing for Local Search.

Yahoo also recently published a patent application that broke pages down into parts, and attempted to identify the most important parts of the page. More on that at: The Importance of Page Layout in SEO.

If you publish web pages, how might the search engines be breaking apart the content of your pages?

Share

20 thoughts on “How Search Engines Can Index Pages in Parts”

  1. The SEs are slowly moving from scraping and listing pages by some relevance process to scraping and displaying x-page and x-site amalgamated data as their own information.

    This may be great for the ‘user’, and the SE, but is definitely bad for the ravaged sites.

    It already affects conversion funnel design, requiring careful thoughtful SE bot blocking procedures. Your patent posts continue to reinforce my desire to diminish SE traffic reliance.

  2. Thanks, iamlost.

    Great point, and one that I think that people should pay attention to – search engines aim to answer questions and to provide information to searchers, and delivering people to Web pages is only part of that – a part that may be diminishing as the search engines get better at extracting and aggregating information.

    Hopefully, as search engines move forward in providing information, without searchers necessarily needing to visit pages to get meaningful results to their queries, the search engines will continue to provide attribution links to the sources of that information.

    Finding ways to get visitors to your pages in significant numbers, without relying upon search engines to deliver those visitors has always been a wise move. It’s going to become more important in the future.

  3. Bill,

    This could eventually have real significance for how pages are written and structured. The smart-search of trying to determine “importance” could be overshadowed by this even smarter search.

    Writing for the search engines is already such a content dilemma for some folks. Making sure you break it up properly, too? Whew.

    Some people write/design better with constraints like that, and some people write mechanical junk. I’d say you’ll see a lot more on either end if object blocks become very important in search results.

    Fascinating.

    Regards,

    Kelly

  4. Hi Kelly.

    One of the good things I see coming out of this is that well-written, and well-structured information on pages would benefit both readers and search engines.

    Those who have troubles with writing content might be best served by hiring someone who is comfortable with writing comprehensive and meaningful pages, whether professional copy writer or graduate level English student, or long time blogger.

    While the types of searches mentioned in the patent filing, and in the Microsoft papers are narrow vertical searches, such as product search and academic papers search, the future might bring us more “question answering” type search requests and answers in more general web searches.

    We may start seeing more factual “answers” to queries above web page listings in search results, similar to the news results and product results that we sometimes see. Those “answers” may have links to their sources. Being displayed as a source of answers on certain queries might be a positive result for people who want their web pages to be more visible in search engine results.

    People who can write informational pages about “objects.” that are comprehesive, and easy for the search engines to extract information from may find those pages at the tops of search results based upon how well the pages can provide information about those objects.

  5. Bill,

    I agree that it should encourage some great content, but I think it might have the opposite effect at the very same time. Misunderstood SEO already does that. The more I think about this object block idea, the more I imagine some helpful, and some awful pages coming from different people trying to write to the search engines.

    Until later,

    Kelly

  6. Leave it to a bunch of engineers to find even more ways to actually *increase* difficulty for non-technical people and businesses who just want/need to publish content that is useful and descriptive to their interests and objectives.

    It never ceases to amaze me when technically smart people come up with lazy solutions that put most of the responsibility on end-users. Rather than making their own indexing and reading systems smarter, these developers actually think the better solution is to ask some small business owner to write content, and split it up into various bits and pieces using essentially a technical language they don’t (and never will) speak – just so a stupid search engine doesn’t have to become smarter.

    This is much the same as has happened with XML/RSS, CSS and other codes which are perfectly easy for engineers to use, but which have only served to make the simple act of publishing content and creating a website much more difficult for non-technical (the majority of) users out there.

    I’ve also seen recent proposals from engineering and technical standards groups, who also think that a website’s usability and interface can and should actually be based on a specific set of standards – which, of course, is something that site and content producers don’t have the time, interest or resources to learn – and which also ignores the fact that usability (which merges not just content, but also nuances of language, interface and interaction, virtually every aspect of visual design, and also user behaviours — which can change *completely* depending on the topical focus or business objectives of the site owners).

    Engineers should make their systems smarter, rather than proposing and implementing even more barriers and obstacles on the backs of the vast majority of average, non-technical people out there who often have important and useful information to share with the world (which is, like, the whole point of the web).

    So, that’s my two cents (from an annoyed, non-technical user of the “World Wide Web”)

  7. Hi Kelly,

    I’m not convinced that most people will amend the way that they write content for search engines based upon this block level approach. Frankly, writing strong content, using language that people would use who would search for what you offer on your site, and organizing it well is the best approach to use in light of this patent.

  8. Hi Thomas,

    I think you may have misunderstood my description of the patent. It shouldn’t really add any additional onus upon most web site designs, and it is actually an intelligent approach to the problems that it attempts to address.

    There’s no need for you to learn any new technical jargon, or break up pages in small bits and pieces.

    There’s nothing in the patent that requires that you change around the way you design web pages. No new difficulties. The process in this patent should benefit pages that are reasonably well written and organized.

  9. Having search engines present their results to visitors in a way that makes it unnecessary for the user to visit the original source page, sounds a little like copyright infringement to me. While it may arguably be a better “user experience”, it also seems as if the engines are overstepping their bounds. It’s their purpose to help us find information, not actually be the deliverer of that said information. Hasn’t Google already been in enough legal trouble from websites too lazy to use a robots.txt file?

  10. Hi Christopher,

    Some good questions.

    It is something that we are seeing more of, from question answering, to showing definitions, to providing images in image search – search engines are answering questions directly with less emphasis on delivering people to the pages that those answers come from.

    Is that the future of search engines? I think that there will always be an emphasis on bringing people to web pages elsewhere. But sometimes those are going to be so that people can check the source of facts, instead of delivering them to pages where they can find what they are looking for firsthand…

  11. Search engines can be really unhelpful, and extremely helpful. It all depends on the user, and what they type in. If they wanted to purchase a camera, they should type in buy Nikon D80 SLR. This would tell the search engine the user is looking to buy a specific item. If you were researching and comparing products, you would more than likely type in digital SLR cameras. This would bring up more reviews and general information about a range of cameras for the user to compare. Search engines are extremely clever, but i think the user also needs to know how they work to really get the full potential.

    A good example, when i was back at school in the old days, computers were just in and google had started competing with Yahoo and ask. I remember all my friends used yahoo, but i always used google. Anyway, thats besides the point. If we wanted a photo of a dog say, we would type in ‘photo of a dog’ whereas now i would type in dog or even more specifically ‘laborador dog’.

    No wonder i could never find what i wanted!!!

  12. I must admit that whenever search engines re-configure content and reproduce it according to what they see as the most appropriate format, I get very worried. Taking the example of the google image search. Some innocent searches such as “priceless” will return some really horrendous content that bears little or no relevance to the search term. However we are at the mercy of these engines and really have no alternative.

  13. This could be great as the information on pages would be better structured and well written, great post, if this was combined with search engine optimisation, sites would have a better chance when working with search engines.

    good post,

    Ricky

  14. Hi John,

    I do try not to leave too much to chance or to the search engines if I can help it. That’s one of the reasons why I look at all of the search engine patent filings and papers that I do. :)

    Hi Ricky,

    Thanks. If the word gets out that search engines do care about well-written content, and semantically well organized layouts, that we might see designers and developers paying more attention to the way that pages are coded. I don’t think that’s a bad result at all, either.

  15. Interesting. Do you know if the search engines would rely upon meta tag information to accomplish this process or if they would be using a newer technology to gather and group their results?

  16. Hi SEOGuy,

    Interesting question. Meta data created by the site owner/designer may not be too much of a factor.

    The idea is to break down pages by analyziing the underlying HTML of the page and using an actual visual analysis of the white space on pages, to break pages down into parts. Once that is done, the search engine would try to understand the topics, or objects, associated with each part.

    The meta description and meta keywords for the page may not be too helpful in making that determination. But looking at other segments on other pages that may seem related to the same topic or object as the segments on the page might. So, if you have ten pages each with multiple reviews of multiple cameras on them, and each of the pages have a review of one particular model of camera, then the segments on all the pages dealing with that one particular camera is what is important, and not the meta descriptions for each of those pages.

  17. Pingback: SEO Daily Reading - Issue 74 « Internet Marketing Blog

Comments are closed.