Sometimes you see an idea appear to repeat itself in the world of search engines, when it comes to intellectual property.
Imagine that you could take a body of queries, and classify them so that you could get a sense of what the searcher’s intent was. Also consider the notion that you could then split up a large database into a number of smaller specialized databases, so that when someone did a search, only some of those databases needed to be looked at to deliver results based upon the classification of the query, with results from more than one database merged together.
Would this method result in more efficient and relevant search results, with less costly processing?
The routing of results based upon classifications could call into play other databases or search processes by a look at the query submitted to the search engines and patterns noticed in query phrases.
Continue reading What Do You Do With a Database of AOL User Queries?
The folks at Marketing Shift have issued a Search Engine Football Challenge, and started a fantasy football league (US. Football) at web 2.0 fantasy football site, FleaFlicker.com.
I just signed up, and added the Delaware Bay Picaroons to the league. Barry Schwartz (RustyBrick), Garrett French (Search Engine Lowdown) , and Thomas Shaffer (MSN) are some of the other folks playing. It looks like there’s a division for search engine marketers, and another one for Search Engine employees.
If you are interested in playing, contact Evan. Contact and other information is included here.
And, how to you grab a random page from that search engine?
A new Google employee, Ziv Bar-Yossef, gave a presentation at Google on August 17th answering those questions, which is available on a Google Techtalk video: Random Sampling from a Search Engine’s Index (video).
Ziv Bar-Yossef was most recently at Technion – Israel Institute of Technology, Israel, and as noted in the video, became a Google employee a couple of weeks ago. Before Technion, he was a researcher at the IBM Almaden Research Center.
The presentation is based upon a paper which won the 2006 International World-Wide Web Conference Best Paper Award: Random Sampling from a Search Engine’s Index
Being able to grab random pages from a search engine’s index can provide some interesting information about that search engine. The presentation compares things such as the number of dead pages in Google, MSN, and Yahoo, as well as the freshness of text on each, and what percentage of dynamic pages they have indexed.
Continue reading How Do You Estimate the Size of A Search Engine?
Assignments of query themes, favored and non-favored pages, ranking based upon editorial opinion – a new patent from Google provides an interesting way of ranking search results in response to queries. Here’s a quick summary of the processes described in this patent granted today to Google.
(1) A method that provides search results which includes:
(a) receiving a search query,
(b) retrieving one or more pages in response to the search query,
(c) determining whether the search query corresponds to at least one query theme of a group of query themes,
(d) ranking the one or more pages based on a result of the determination, and;
(e) serving those ranked pages.
(2) A method for determining an editorial opinion parameter for use in ranking search results:
(a) Developing one or more query themes,
(b) Identifying, for each query theme, a set favored pages,
(c) Identifying, for each query theme, a set of non-favored pages, and;
(d) determining an editorial opinion parameter for all of the pages in those sets.
Continue reading Google looks at Query Themes and Reranking Based upon Editorial Opinion
Somehow I missed this video tour of Yahoo’s headquarters when it came out on the Yahoo Corporate blog
The purple cow in the front lobby is a nice touch, and the trip inside the data center is intriguing, too.
Can looking at web traffic flowing through internet access points from Internet Service Providers help a search engine crawl the web more effectively?
A patent originally developed by the folks at Fast Search and Transfer, and assigned to Overture, was granted last week on the topic of improving the crawling of web pages by looking at that traffic, and it lays out the framework for doing so in fine detail. It also points out some of the limitations in not adopting such a practice while also explaining many of the benefits.
Some of these limitations include problems with:
- Starting to crawl the web from seed pages,
- The limited amount of access time crawlers have to servers,
- Difficulties crawlers have in retrieving dynamic objects, and
- Link topology as a source of relevance.
Continue reading How a Search Engine Might Use Information from an ISP While Capturing Traffic Flows
I had the good fortune to be able to meet Jim Hedger at the San Jose SES a little over a week ago. While we didn’t have the opportunity to talk at great length, it was nice to meet him. I’ve been reading his blog posts and articles for a few years now. I really enjoyed one of his latest.
On the Tuesday during the four day conference, I ran into Jill Whalen, who had just finished an interview with someone outside of the press room in the conference hall. It was good to be able to say hi, though I caught Jill going to another interview. Seems like she had a pretty full day of interviews. One of them was with Jim – Jill Whalen Interviewed at SES San Jose. Jill makes some pretty astute observations. Definitely worth a read.
Jill talks about the growth and maturation of the Search Marketing Industry, a larger focus on in-house SEO, more women in the search sector, the importance of educating clients, and the next High Rankings Seminar in Texas in October. I’ve been a guest at a couple of those seminars, and I’d highly recommend them to people interested in learning more about search engine marketing.
Nice interview, Jim and Jill.
A new patent application from Microsoft looks at content generated to spam search engines. Here’s the problem, as noted in the patent filing:
In the best case, search engine optimizers help web site designers generate content that is well-structured, topical, and rich in relevant keywords or query terms. Unfortunately, some search engine optimizers go well beyond producing relevant pages: they try to boost the ratings of a web site by loading pages with a wide variety of popular query terms, whether relevant or not. In fact, some SEOs go one step further: Instead of manually creating pages that include unrelated but popular query terms, they machine-generate many such pages, each of which contains some monetizable keywords (i.e., keywords that have a high advertising value, such as the name of a pharmaceutical, credit cards, mortgages, etc.). Many small endorsements from these machine-generated pages result in a sizable page rank for the target page. In a further escalation, SEOs have started to set up DNS servers that will resolve any host name within their domain, and typically map it to a single IP address.
Most if not all of the SEO-generated pages exist solely to mislead a search engine into directing traffic towards the “optimized” site; in other words, the SEO-generated pages are intended only for the search engine, and are completely useless to human visitors.
I recognized this quote, which is taken from an interesting research paper from Microsoft, Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages. If you are interested in how search engines are attempting to fight web spam, it’s a “must read” paper.
Continue reading Page Quality and Web Spam: Using Content Analysis to Detect Spam Pages