Unfamiliar with a topic, and want to find a simple page on a subject – one that didn’t require background reading or knowledge to understand the page?
More familiar with that subject, and you want to find an advanced page on the web?
Could a search engine help you find pages and rerank them based upon how familiar you may indicate that you are with the topic related to your query? It’s possible.
A search engine might pay attention to the following when indexing pages:
- Reading levels for the page,
- Word lengths of sentences and other features of text on the page,
- How simple or complex the stopwords* used upon a page may be.
*Stopwords are the most frequently appearing words in a search engine’s index, and they often aren’t indexed because they appear so frequently. Some stopwords are more complex than others. Stopwords in U.S. English that indicate a page is simple and informal might include: “so, enough, just, in, needs, help, each, away.” Stopwords in U.S. English that might show a page to be more complicated and formal may include: “if, cause, while, way, though, which, us.”
Yahoo and Topic Familiarity
You may have seen the Yahoo! Mindset page (no longer available), which reranks search result pages based upon whether those pages are more commercial or more informational. After you enter a search and see a results page in Yahoo! Mindset, you also see a slider bar at the top of the results that have the word “shopping” on one side and “researching” on the other, with a ball on the line in the middle that you can slide towards either end. If you slide that ball, either way, the search results under its change. Sliding towards “shopping” brings back more commercial sites. Sliding towards “researching” returns more informational sites.
Imagine that instead of “shopping” and “research,” one side said “introductory” and the other side said “advanced.” That’s similar to the idea behind a new patent application that appears to be based upon research conducted at Yahoo!
A paper from 2005, Biasing Web Search Results for Topic Familiarity, explores this topic. It was written by Giridhar Kumaran of the University of Massachusetts, and Rosie Jones and Omid Madani of Yahoo. The paper was presented at the Conference on Information and Knowledge Management for the Proceedings of the 14th ACM International Conference on Information and Knowledge Management, in Bremen, Germany last year.
A patent application published last week seems to cover the same ground, and share the same authors:
System and method for biasing search results based on topic familiarity
Invented by Rosie Jones, Giridhar Kumaran, and Omid Madani
US Patent Application 20060212423
Published September 21, 2006
Filed: March 16, 2006
Abstract
A familiarity level classifier comprises a stopwords engine for conducting a stopwords analysis of stopwords, e.g., introductory-level stopwords and advanced level stopwords, in a document, e.g., a website; and a familiarity level classifier module for generating a document familiarity level based on the stopwords analysis. The classifier may be in an indexing module, a search engine, a user computer, or elsewhere in a computer network. The classifier may also include a reading level engine for conducting a reading level analysis of the document, and wherein the familiarity level classifier module is configured to generate the familiarity level also based on the reading level analysis. The classifier may also include a document features engine for conducting a feature analysis of the document, and wherein the familiarity level classifier module is configured to generate the document familiarity level also based on the feature analysis.
Classification of a Familiarity Level
When indexing pages, the familiarity level classifier looks at three types of things:
- The distribution of stopwords in the text,
- Document reading level, and;
- Document features such as average line-length.
The topic searched for, queries used, and information about the searcher isn’t considered at all.
Some Reading level measures that might be used:
- The Gunning Fog measure,
- The Flesch measure,
- The Kincaid measure,
- Number of characters,
- Number of words,
- Percentage of complex words,
- Number of sentences,
- Number of text lines,
- Number of blank lines,
- Number of paragraphs,
- Number of syllables per word,
- The number words per sentence, and/or;
- Others
Pages are classified as introductory or advanced based upon factors like those above. A slider like in Yahoo! Mindset, or something similar, can be used by a searcher to determine how they want pages to be reranked – introductory or advanced.
Conclusion
This is an interesting idea, but would Yahoo use something like it in their search engine, or would it be a novelty or toy-like Yahoo! Mindset, hidden away somewhere on a research page that not many people visit?
If you build informational web pages, would this patent application convince you of the wisdom of building both introductory pages and pages for advanced searchers who are more familiar with the topic that you are writing about?
Hi Stephen,
Good questions, and points worth considering carefully.
They do specifically state in the patent application that this indexing and reranking is done independently of looking at specific topics, queries, and user behavior.
They are focusing upon content that they find on the pages of sites, without looking at many of the things that we are seeing in other papers and patent filings that involve personalized search.
There’s nothing in the patent that indicates a desire to look at a searcher’s past search history, or what pages large groups of users click upon and stay upon when entering certain queries, or a look at a user’s profile – whether one the user creates themself or one that the search engine might construct behind the scenes.
Could it be a building block for some type of personalization? Maybe, but one based upon understanding a little better what kind of content exists upon pages that people visit.
The potential danger in using something like this for personalization is that while some people may not have familiarity with many topics, those same folks may have a deep understanding and expertise in others.
The same person seeking introductory level pages in cooking and accounting and car repair may be have an advanced familiarity in folksongs or linear algebra, for example. 🙂
The search engine thinking that it knows what people what regardless of what they really want is a frightening concept. I’ll definitely agree with that.
Bill,
Do you think these may have anything to do with the onset of the personalized search. More specifically, could these techniques be used to provide users with what the engine thinks the user wants? Or possibly just the next level?
I am all for conforming results to best fit user intention, however, I think that this sould be an opt-in experience. As an online marketer, user intention scares and delights me. I think it is based more upon a “fear of the unknown.”
Are we essentially giving up the freedom that the web provides by being given the information that the engine thought we wanted, even if it is based upon our history on the web?
Deep Thoughts… by Steve Pitts
One problem that could occur with Gunning-Fog-like algos is that a content with a high score can mean two things : it is very technical/literrate, or it is very badly written.
Hi Sébastien,
I believe that you may be right. I guess the reason why they would use more than just a readability score would be to try to get a sense of the difference between the two.