Google on Crawling Websites

When I talk with someone about how a search engine works, I find it convenient to break the process down into three parts, because there are three primary functions that a search engine performs.

These three parts are Crawling, Indexing, and Serving Results. I like using this three-part breakdown because I find that it makes it easier to explain how each of those parts work by themselves, and together with the other parts.

A patent granted to Google today, and originally filed in 2000, explores the first of those parts – crawling websites.

This is an interesting area, because having some knowledge of it might help to explain why some pages on the Web get indexed, and why some other pages might not. There are a couple of links that I like to point people towards when I talk about Google and crawling websites.

Continue reading “Google on Crawling Websites”

Microsoft on Index Partitioning

What’s a good way to organize the index of a search engine?

A way that is fast and returns a lot of relevant results? Maybe one that doesn’t need to be search the whole index to find results?

A newly granted patent from Microsoft provides some interesting insights into indexing by document, and how static ranking factors may influence whether a document is in a main index partition, or if it might be found in a later partition acting like an extended index.

Microsoft Indexing - 1

In a recent post from Dan Thies, Why Google Can’t Just “Dump” PageRank, he discusses the importance of pagerank as mechanism for a search engine to use to decide which pages to put in its main index, and which ones to put in its extended index.

Continue reading “Microsoft on Index Partitioning”

Microsoft Playing with Blocks to Understand How Images Might be Related

What is the most important part of a page? If a page has images on it, what images are the most important ones?

If a search engine were to try to understand whether or not any images on the pages of a site were related to each other, how would it go about figuring that out?

The first two questions are easy to answer – the most important part of a page is the part that visitors focus upon when they look at it. The most important images are the ones that people look at and pay attention to when they are on that page.

A newly granted patent from Microsoft tries to solve all three questions in an automated manner that can break a page down into blocks, and decide a level of importance amongst those blocks when comparing them to each other – what is the probability that a user will focus upon each of those blocks (or upon images within those blocks) when looking at the page.

It might consider the importance of one block to another on the same page and on other pages within the same site by looking at links between the blocks on those pages. It might view whether images are within the same blocks or related blocks, and also look for links to images from different blocks to see if and how images might be related.

Continue reading “Microsoft Playing with Blocks to Understand How Images Might be Related”

Measuring User Engagement, with Examples from Yahoo

If you own a web site, how do you measure the way that people interact with your site? What data do you look at, how do you analyze it, and what do you do with that analysis?

The topic is becoming a popular one on the Web, and I have some links below to some articles on the subject that I thought were pretty interesting. I was inspired to collect those links after looking at a patent filing from Yahoo that describes some of the methods that they might use to try to understand how engaged people are upon their web properties.

The patent application is Techniques for measuring user engagement, and the listed inventors are Francesca M. Soito and Nitin Sharma (who appears to have now moved to Google).

User Engagement Variables

Continue reading “Measuring User Engagement, with Examples from Yahoo”

Google on Using Named Entity Disambiguation to Make Searches Smarter

When a search at a search engine includes a person’s name, or the name of a particular place, or a book, or a band, or an album, there might be some confusion as to which person (or place or thing) is being searched for.

Danny Sullivan, Race Car Driver

Case in point, there’s a well-known race car driver by the name of Danny Sullivan. There’s also a well-known journalist who writes about the search industry by the name of Danny Sullivan.

Danny Sullivan, Journalist and Technologist

Continue reading “Google on Using Named Entity Disambiguation to Make Searches Smarter”