A new patent application on near duplicate content from Google explores using a combination of document similarity techniques to keep searchers from finding redundant content in search results.
The Web makes it easy for words to be copied and spread from one page to another, and the same content may be found at more than one web site, regardless of whether its author intended it to be or not.
How do search engines cope with finding the same words, the same content, at different places?
How might they recognize that the pages that they would want to show searchers in response to a search query contain the same information and only show one version so that the searchers don’t get buried with redundant information?
A lot of web pages and documents reuse the same text in sidebars and in footers at the bottoms of pages, like copyright notices and navigation sidebars.
Computer programmers will sometimes use the term “boilerplate” code to refer to standard stock code that they often insert into programs. Lawyers use legal boilerplate in contracts – often the small print on the back of a contract that doesn’t change regardless of what a contract is about.
It might be a good step for a search engine to ignore boilerplate text when it indexes pages, or uses the content of pages to create query suggestions for someone using a desktop personalized search. Ignoring boilerplate in the same documents could be helpful when using those documents to rerank search results in personalized search.
New York Times Boilerplate
Search engines are getting smarter about the phrases that they see and understand online, and Yahoo recently published a patent application that describes a number of the ways that they learn about and understand the use of phrases in documents on the Web.
Exploring how Yahoo might use phrases to rerank search results may show how they may try to understand data from published documents on the Web, and from log files that collect information about the queries that people use when they search for information about different concepts.
From Keyword Matching to Phrase-Based Indexing
A page’s placement in search results for certain queries can involve looking at ranking criteria and algorithms applied to documents involving keywords in search queries for things like:
Many patent filings and papers from the search engines discuss ways that they might shuffle around search results to try to provide more relevant responses to people’s searches.
Imagine a search engine changing around the results that you see, not based upon the time that a page is published, but rather on some estimate of the importance of a page to you, and how that importance might vary with time and your personal calendar. Sounds like a tricky proposition, doesn’t it?
Microsoft adds an element of time to search results by introducing a search system that pays more attention to what is happening on your computer and within your company intranet.
This more complex search system could be used to index both search results and information found on a person’s desktop and local network. The search system would pay more attention to the context of searches and add personalization to those searches by building a user profile to distinquish how important different information might be to each individual searcher.
When you start typing a query into the search box at Yahoo, you’ll see a dropdown appear under the search box with some suggestions predicting queries that you may want to see Web search results even before you finish typing.
But presently you only see those suggestions for Web search results. I wrote about those Yahoo search suggestions in Predictive Queries versus Unique Searches.
It would be interesting to see suggestions from some of Yahoo’s other databases appearing, such as image search or local search.
A couple of recent patent applications from Yahoo, related to the “predictive queries” patent filing, explore showing how the context of a search and historic search patterns may cause suggestions from other search databases.
I write a lot about patents and white papers from search engines, and sometimes the subjects covered in those documents can get technical pretty quickly.
I encourage people who are just starting out in SEO to leave comments, and ask questions, but I know that sometimes a closer look at some of the basics may be what visitors here might be looking for.
Fortunately, there are a lot of flavors of blogs focusing upon search engine optimization and internet marketing, and my blog roll is filled with blogs from people taking many different approaches, from different perspectives.
A friend of mine, Kimberly Bock, has a compelling blog focusing primarily upon Responsible Networking (no longer available).
Some recent recommended posts from Kim include:
Not too long ago, if you entered in Google the phrase (without quotation marks) “a room with a view,” you might have received some warnings that your query contained “Stop Words.”
Stop words are words that appear so frequently in documents and on web pages that search engines would often ignore them when indexing the words on pages. These could be words like: a, and, is, on, of, or, the, was, with.
Good bye to stop words?
In that search for “a room with a view,” you might have received results like “a room for a view,” or “room to view,” or other phrases that replaced some stop words with others. That made it less likely to find exactly what you were looking for when you searched for a phrase with stop words in it.
Choosing the right character set for your web page might mean that it is easier for a search engine to understand what language your page is in, though there are also other ways that it might be able to determine that.
But, what about when someone types in a query?
- How does a search engine know what language a search query might be in?
- How does it handle queries in different languages made on devices that might not be capable of creating some special characters outside of the latin alphabet?
Also, do webpages that use a certain charater set (something that webmasters can choose in their HTML for a page) stand a better chance of having the language that they use be identified more easily by a search engine?