Not too long ago, if you entered in Google the phrase (without quotation marks) “a room with a view,” you might have received some warnings that your query contained “Stop Words.”
Stop words are words that appear so frequently in documents and on web pages that search engines would often ignore them when indexing the words on pages. These could be words like: a, and, is, on, of, or, the, was, with.
Good bye to stop words?
In that search for “a room with a view,” you might have received results like “a room for a view,” or “room to view,” or other phrases that replaced some stop words with others. That made it less likely to find exactly what you were looking for when you searched for a phrase with stop words in it.
Continue reading New Google Approach to Indexing and Stopwords
Choosing the right character set for your web page might mean that it is easier for a search engine to understand what language your page is in, though there are also other ways that it might be able to determine that.
But, what about when someone types in a query?
- How does a search engine know what language a search query might be in?
- How does it handle queries in different languages made on devices that might not be capable of creating some special characters outside of the latin alphabet?
Also, do webpages that use a certain charater set (something that webmasters can choose in their HTML for a page) stand a better chance of having the language that they use be identified more easily by a search engine?
Continue reading How Does a Search Engine Know the Language of A Query? Google Explores Character Mapping
A new Microsoft patent application has some interesting statements within it about blogs. First it tells us of the value of blogs and blogging:
Blogging has grown rapidly on the internet over the last few years. Weblogs, referred to as blogs, span a wide range, from personal journals read by a few people, to niche sites for small communities, to widely popular blogs frequented by millions of visitors, for example.
Collectively, these blogs form a distinct subset of the internet known as blogspace, which is increasingly valuable as a source of information for everyday users.
Then it goes on to tell us that search engines work to limit results from blogs in searches, and the difficulties that search engines sometimes have in identifying blogs:
Continue reading Do Search Engines Hate Blogs? Microsoft Explores an Algorithm to Identify Blog Pages
There are often three pieces of information about pages displayed in search results to searchers in response to a search:
- Page title,
- The URL where that page can be found, and;
- A summary of the page in the form of a snippet or snippets, taken from either a meta description tag, or a description of the page from a directory like the DMOZ, or actual text from the page itself.
One mystery involving search engines involves how a snippet might be generated when it is taken from a page.
Continue reading How does Google Pick Search Snippets for Your Pages to Show in Results?
How does a search engine use information from anchor text in links pointed to pages?
Why and how do some pages get crawled more frequently than others?
How might links that use permanent and temporary redirects be treated differently by a search engine?
A newly granted patent from Google, originally filed in 2003, explores these topics, and provides some interesting answers, and even some surprising ones.
Of course, this is a patent, and may not necessarily describe the actual processes in use by Google. It is possible that they are being used, or were at one point in time, but there has been plenty of time since the patent was filed for changes to be made to the processes described.
It has long been observed and understood that different pages on the web get indexed at different rates, and that anchor text in hyperlinks pointing to pages can influence what a page may rank for in search results.
Continue reading Google Patent on Anchor Text and Different Crawling Rates
When I talk with someone about how a search engine works, I find it convenient to break the process down into three parts, because there are three primary functions that a search engine performs.
These three parts are Crawling, Indexing, and Serving Results. I like using this three part breakdown because I find that it makes it easier to explain how each of those parts work by themselves, and together with the other parts.
A patent granted to Google today, and originally filed in 2000, explores the first of those parts – the crawling of web pages.
This is an interesting area, because having some knowledge of it might help to explain why some pages on the Web get indexed, and why some other pages might not. There are a couple of links that I like to point people towards when I talk about Google and crawling web pages.
Continue reading Google on the Crawling of Web Sites
What’s a good way to organize the index of a search engine?
A way that is fast and returns a lot of relevant results? Maybe one that doesn’t need to be search the whole index to find results?
A newly granted patent from Microsoft provides some interesting insights into indexing by document, and how static ranking factors may influence whether a document is in a main index partition, or if it might be found in a later partition acting like an extended index.
In a recent post from Dan Thies, Why Google Can’t Just “Dump” PageRank, he discusses the importance of pagerank as mechanism for a search engine to use to decide which pages to put in its main index, and which ones to put in its extended index.
Continue reading Microsoft on Index Partitioning
What is the most important part of a page? If a page has images on it, what images are the most important ones?
If a search engine were to try to understand whether or not any images on the pages of a site were related to each other, how would it go about figuring that out?
The first two questions are easy to answer – the most important part of a page is the part that visitors focus upon when they look at it. The most important images are the ones that people look at and pay attention to when they are on that page.
A newly granted patent from Microsoft tries to solve all three questions in an automated manner that can break a page down into blocks, and decide a level of importance amongst those blocks when comparing them to each other – what is the probability that a user will focus upon each of those blocks (or upon images within those blocks) when looking at the page.
It might consider the importance of one block to another on the same page and on other pages within the same site by looking at links between the blocks on those pages. It might view whether images are within the same blocks or related blocks, and also look for links to images from different blocks to see if and how images might be related.
Continue reading Microsoft Playing with Blocks to Understand How Images Might be Related