Web pages can be messy; they can have more than one topic on a page, and use templates that surround those topics adding little meaning to the meat of the content, filled with links and labels, advertising and boilerplate, copyright and other notices.
With a diversity of topics, those pages may not be easily crawled and recorded and indexed and found, by search engines and searchers.
When we think of search engines and how they work, we often break what they do down into three main parts – discovering new pages and new content on old pages, indexing content on those pages following rules that show a preference for important pages and unique content, and presenting relevant and meaningful information to searchers and their intents (or at least matching their keywords) in response to queries that they enter into a search box.
We usually don’t think of search engines as indexing parts of pages, chunks of information that might exist side-by-side with very different topics, and yet many pages are messy like that.
But we’ve had signs from the white papers and patent filings that we see from search engineers, that they might try to segment and capture information about different topics found on the same page.
Continue reading Microsoft Granted Patent on Vision-Based Document Segmentation (VIPS)
Come gather ’round people
Wherever you roam
And admit that the waters
Around you have grown
And accept it that soon
You’ll be drenched to the bone.
If your time to you
Is worth savin’
Then you better start swimmin’
Or you’ll sink like a stone
For the times they are a-changin’.
– Bob Dylan
Can the rate of change upon web pages influence how Google might rank pages of a site?
In part one of this series, I looked at how Google’s patent on Information retrieval based on historical data focused upon Freshness.
This second part of the series explores how Google might look at content changes on Web pages, and how the frequency of those changes might influence how those pages may rank at the search engine. Keep in mind that we don’t know for certain whether Google is even using the processes described in this patent. But it is a possibility.
Continue reading Updating Google’s Historical Data Patent, Part 2 – Changing Content
Before becoming a co-founder of the new search engine Cuil, Anna Lynn Patterson worked at Google upon a way of looking at how often different phrases appeared together on pages on the Web, described in a series of patent applications which share a common description, with different claims sections that itemize different parts of that description.
I summarized the description from one of the patent filings in my post from December 29, 2006, in Phrase Based Information Retrieval and Spam Detection
One of the patent applications from that series, Automatic taxonomy generation in search results using phrases, which I hadn’t originally come across back in 2006, was granted today, and covers the idea of taking documents that share related phrases, and clustering them with the related phrases to provide search results that might cover a range of categories related to search queries.
Continue reading Google Phrase Based Indexing Patent Granted
Images on a web page can provide a chance to express ideas in a visual way that can convey a considerable amount of information, and may also add to the attractiveness and perceived quality of a site.
When search engines rank pages in search results, images may have some impact in those rankings.
A search engine might look at the captions associated with pictures, or alt text provided as an alternative for when people browse the Web without images turned on or when those browsers are using screen reading software.
Search engines might also look at text surrounding an image, especially within the same HTML container, or block or segment.
Those indexing services could also associate other content on a page with an image, including the page’s title.
Continue reading How Search Engines May Use Images to Rank Web Pages
In March, one of the more interesting patent filings from Google was granted, Information retrieval based on historical data.
I had discussed it on forums when the original patent application came out in March of 2005, but didn’t provide a write up of the document here. I realized a few weeks ago that I probably should.
The historical data patent is important because it discusses a large number of techniques that a search engine might use in fighting “spamming techniques” that might artifically “inflate” the rankings of web sites, and it works to identify “stale” sites that may be ranked higher than fresher sites that might contain more recently updated information.
I’ll be writing a few posts over the next few weeks about the patent, and try to include some updates that have happened since it was first published. This first post looks at how the “freshness” of a page or document might influence its rankings in search results.
Continue reading Updating Google’s Historical Data Patent, Part 1 – Freshness
The Web is filled with page after page after page of data. That data is usually organized differently from one site and one page to another, and contained in text, in pictures, in videos, in audio, in columns, in rows, in frames, and many other formats.
When a search engine spider comes to a page on the Web, it will try to go through all of the text it finds, make note of links to other pages, consider alt text for images, and view meta data tags.
Search engines spiders will decide whether or not the content of pages should be indexed by the search engine, and determine which links to follow next.
Continue reading Search Engines Extracting Table Data on the Web
When you’re searching for something on the Web, does it matter whether you use the singular or plural version of a word in your search?
For example, let’s say that you are looking for a new pair of sneakers to go jogging in, and you want to find the right combination of comfort and support, so you decide to look into the best sneakers for running. Does it make a difference in search results when you type in running shoes or running shoe in a search box?
If a search engine just returned results to you based upon your choice of a singular or plural search term, would you get the best results? Should a search engine explore both versions, and try to provide you with a mix of results based upon what it believes are the best results, after looking at results from the singular and plural versions?
A quick look at the top ten results at Yahoo and at Google for both “running shoe” and “running shoes” (both searches without the quotation marks) showed some overlap in pages returned for singular and plural versions at each search engine, but the vast majority of search results seem to focus upon returning results for the plural version of the word, instead of the singular version.
Continue reading How a Search Engine Might Handle Singular and Plural Queries
When someone searches at one of the major search engines, they often type in keyword phrases, composed as if they were written for human readers. Those phrases may contain words or phrases that show up very frequently in pages on the web, and have little to do with the information being sought by the searcher.
Search engines that focus upon retrieving search results based upon keywords found in queries have often ignored those frequently appearing and irrelevant words contained within search query terms.
Those words have been referred to in the past by Google as “stopwords,” and could be words like: a, and, is, on, of, or, the, was, with. Similar groups of words that appear very common on web pages, and are also unconnected to an actual search could be referred to as “stop-phrases.”
The word “a” in the query “a London hotel” is a stopword.
Continue reading Google Stopwords Patent