How does a search engine choose whether to show news items in web search results and when not to?
If you live in Bealton, Virginia, chances are that you may not be too interested in news of a car crash in Brooklyn, New York, when searching for information about Brooklyn. If you’re from Brooklyn, and want to find vacation information about the parks in Wisconsin, you may not be very concerned about the latest winning numbers in the Wisconsin lottery. Yet, someone searching for information about one of the states bordering the Gulf of Mexico these days might be likely to want to see news about the Oil spill in the region.
A Yahoo patent filing published recently describes how they might use a prediction system based upon the search engine’s query logs to decide whether or not to show news results. The prediction system uses a mixture of geographic information related to queries and to searchers as well as information about how “newsworthy” a location might be to make that determination. The patent tells us that it might create similar prediction models to determine whether or not to show other types of results as well. The patent application is:
System and Method of Geo-Based Prediction in Search Result Selection
Invented by Rosie Jones, Fernando Diaz, and Ahmed Hassan Awadallah
US Patent Application 20100161591
Published June 24, 2010
Filed December 22, 2008
Continue reading How Search Engines May Use Geography and Population Info in Deciding to Show News in Web Searches
Optical Character Recognition, or OCR, is a technology that can enable a computer to look at pictures that include text, and translate those visual representations of text into actual text. If you have words within images on your web pages, there’s a good chance that search engines are ignoring those words, when it comes to indexing your pages.
But that might change sometime in the future.
While OCR has been around for a while, search engines haven’t been using the technology when crawling and indexing the content of Web pages. Google’s webmaster guidelines tell us:
Try to use text instead of images to display important names, content, or links. The Google crawler doesn’t recognize text contained in images. If you must use images for textual content, consider using the “ALT” attribute to include a few words of descriptive text.
Yahoo’s page, How to Improve the Position of Your Website in Yahoo! Search Results provides the following tip:
Continue reading Teaching Computers to Read Newspapers: How a Search Engine Might Use OCR to Index Complex Printed Pages
Earlier this month I wrote about a granted Google patent, and a continuation of that patent filed earlier this year, that describe How Google Might Suggest Topics for You to Write About, by providing information to web publishers on queries and topics that are either under-represented in search results or where there’s more demand for information about those topics or queries than there are search results to meet that demand.
The topic struck home with a number of people, especially journalists, and I had a chance to have a conversion with Financial Times (FT.com) reporter Kenneth Li about Google’s patents. The Financial Times ran with two different stories on the topic (Google shadow over new media groups, and Google eyes Demand Media’s way with words), focusing primarily on how the technology involved in the patents could bring Google into competition with companies such as Demand Media, Associated Content, and AOL.
While searching through patent filings this morning, I came across an interesting newly published patent application from Demand Media. In the FT.com article on Demand Media, we’re told that:
Continue reading How Demand Media May Target Keywords for Profitability
When a search engine shows you results for a search, the pages shown are likely in order based upon a mix of relevance and importance.
But a search engine doesn’t usually stop there. It may look at other things to filter and reorder search results.
In 2006, I wrote 20 Ways Search Engines May Rerank Search Results, which described a number of ways that search engines may rerank pages. I followed that up in 2007 with 20 More Ways that Search Engines May Rerank Search Results.
I decided that it was time for a sequel or two in this series. I came up with another 25 reranking methods, but decided to stop at 10 in this post.
Many of the following are described in patents, and some of those patents were originally filed years ago – prehistoric times in Web years. The search engines may have incorporated ideas from those patents into what they are doing now, adopted those methods and since moved on to something new, or put them in a filing cabinet somewhere and forgot about them (I’d like the key to that filing cabinet).
Continue reading Another 10 Ways Search Engines May Rerank Search Results
This post may get you thinking about the benefits of using heading elements and lists on web pages for SEO purposes from a slightly different perspective than you may be used to.
Google uses a large number of signals to decide upon the order of pages shown in search results. Some of those signals measure the quality or importance of a web page, while others may indicate how relevant a page is for a particular search query entered into a search engine’s search box.
One fairly obvious relevancy signal is whether or not the words in a query actually appear upon a page that might be a search result for that query. If those words appear on the page more than once, the page might be considered even more relevant for that particular query than other web pages where the terms only appear once, or not at all.
Another factor that might indicate how relevant a page is for a particular set of terms is how close those terms might be on a page. While you could easily count the number of words between individual query terms to determine how close they are to each other, the formatting of web pages presents some challenges to the approach of simply counting words between terms, such as in a list like the following:
Continue reading Google Defines Semantic Closeness as a Ranking Signal
Not every link from a page in a link-based ranking system is equal, and a search engine might look at a wide range of factors to determine how might weight each link on a page may pass along.
One of the signals used by Google to rank web pages looks at the links to and from those pages, to see which pages are linked to by others. Links from “important” pages carry more weight than links from less important pages. An important page under this system is one that is linked to by other important pages, or by a large number of less important pages, or a combination of the two. This signal is known as PageRank, and it is only one of a large number of Google ranking signals used to rank web pages and determine how highly those pages show up in search results in response to a query from a searcher.
An early paper by the founders of Google, The Anatomy of a Large-Scale Hypertextual Web Search Engine, tells us:
PageRank can be thought of as a model of user behavior. We assume there is a “random surfer” who is given a web page at random and keeps clicking on links, never hitting “back” but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank.
Continue reading Google’s Reasonable Surfer: How the Value of a Link May Differ Based upon Link and Document Features and User Data
Would search engines be better if they started web crawls from sites like Twitter or Facebook? Wikipedia or Mahalo? DMOZ or the Yahoo Directory?
The Web refreshes at an incredible rate, with new pages added, old pages removed, and words pouring out from blogs, news sites, and other genres of pages. Ecommerce sites showcase new products and eliminate old ones. New sites launch and old domains expire.
Search engines attempt to keep their indexes of the Web as fresh as possible, and send out crawling programs to find the new, update changes, and explore disappearances. Failure to do so means outdated search engines that deliver people to deleted pages, overwritten content, and stale indexes that miss out on new sites.
When a search engine starts crawling the Web, it often begins by following URLs from chosen seed sites to explore other pages and other domains. But how does a search engine choose those seed sites?
Continue reading What Makes a Good Seed Site for Search Engine Web Crawls?
Would it surprise you if searches on the Web make up around 10 percent of all pageviews on the Web, and indirectly led to more than 21 percent of the pages viewed online? It surprised a couple of researchers from Yahoo.
That’s the result of a study conducted by Ravi Kumar and Andrew Tomkins from a sample of over 50 million user pageviews that they collected during 8 days in March, 2009. The information was captured through the Yahoo toolbar from people who agreed to the collection of data for this kind of analysis. Additional information was added by looking at the search logs from Yahoo.
While the data is limited to users of the Yahoo toolbar who agreed to the use of the data, and doesn’t include mobile searches or searches that used AJAX to display results, it does capture how people browse the Web and search at a number of search engines as well as searches at sites like eBay and Amazon.
The study is described in a paper titled A Characterization of Online Search Behavior (pdf), and is being presented tomorrow at the WWW2010 Conference in a session dedicated to User Models on the Web.
Continue reading Yahoo Study Shows Search Responsible for 1 in 5 Pageviews Online