When a judge looks at evidence entered into court, he weighs a number of factors. One of them is whether the evidence offered is relevant to the case at hand.
The other is how important that evidence might be.
Now, a piece of evidence by itself doesn’t have to be groundbreaking to important, but for example, testimony related to the character of a 40-year-old defendant in a criminal proceeding by his third-grade teacher may be somewhat relevant, but probably not all that important.
Importance Scores in Search Engines
When a search engine ranks pages for a set of search results, it also usually looks at two different and distinct types of calculations, which it combines together to serve pages to searchers. Those scores likewise focus upon how relevant a result might be to a query entered, and how important that page or picture or video might be.
A recent patent application from Microsoft takes a look at the way that “importance” is determined, and comes up with a variation that differs somewhat from what we might usually think of in an importance score.
One importance score that most people might be familiar with is PageRank and is sometimes referred to as a “static” rank because it doesn’t change from query to query.
Expanding Importance Ranking Factors
Would it be faster and easier, if when we perform a search, we could choose to search amongst the most important “sports” pages, or “news” pages or “recipe” pages for the results of our search? Would we get more relevant results?
One place to start looking at this topic is the Microsoft patent application:
Providing and using search index enabling searching based on a targeted content of documents (Appl. No. 20070203891)
Invented by John A. Solaro and Keith D. Senzel
Assigned to Microsoft
Published August 30, 2007
Filed: February 28, 2006
The authors of this patent make some good points. For instance:
Many new search engines, and new features for existing search engines, are being developed that focus on one specific “vertical” subject matter domain to provide shopping searches, blog searches, research searches, and the like.
However, the static rank of the documents in the index only takes into account generic PageRank attributes, not attributes related to a specific vertical that targets specific subject matter.
Therefore, the static rank is not useful for filtering the index for particular attributes of the vertical in question, which critically limits the effectiveness and utility of these vertical search engines for users.
While using something like PageRank might be an easy method to determine the importance of a page, it may not be the best when comparing things like shopping sites and educational sites.
Different Ranking Factors
The Microsoft patent filing doesn’t go into a lot of detail regarding how it might rank documents in one topic differently than documents in another, though the “claims” section of the document does mention calculating a “readability score” for documents. We might look at an example from another search engine to see how that could be done.
It wasn’t too long ago that a Google patent filing showed how Google might rank blogs based upon a “quality” score, which would be an importance score rather than a relevance score.
So, if a search engine was to go through different websites, and determine topics or categories for them, and then come up with different kinds of “quality” or “importance” scores for those different types of topics, would we see more relevant results in our searches? It’s possible.
Blending Results Found with Different Ranking Algorithms
I can’t help but think of universal and blended search results being served by the search engines these days, when reading this patent application, where different types of results, based upon different types of importance factors are being blended into search results.
How does a search engine determine which video or image or news result is the most important when it mixes those into a result set? A timely news result isn’t ranked in importance based upon the number of links to it – it wouldn’t be timely if it was.
Importance Factors Conclusion
How might a search engine decide whether one type of shopping site was more important than another? Or sports sites? Or news sites? Definitely something to think about.
Well that was interesting. So once again, establishing the context or relevance of a document, this time to a given theme (news, sports, shopping – etc) is the gig?
‘wherein the targeted content indication describes a relevance of the document to a specific search topic ‘
I would suppose defining ‘topics’ would be important from an infrastructure stand-point since it could get bogged down in a hurry in a ‘rules creating rules’ type of system. Would one ultimately create scores for ‘sub-topics’? – such as Sports-Baseball? If that went on too far it would be a tough layer on the ranking/re-ranking system to manage. Would the value (minor ranking mechanism?) outweigh the overhead of it? (end justifying the means)
..and when they are talking about the ‘static pagerank’ are they implying various pagerankings for the same document depending on the channel (topic)?
Anyways, I’ll have a read through it more later… tnx.. Dave.
Oh yeah… I noticed the ‘Law’ is creeping into the posts this week… Ghosts from the past methinks :0)
Cya
It’s difficult to let go of that legal training, and it oddly seemed to fit, though I understand more now, after writing the blog post, why.
I was working in the Courts for a few years and learning about ecommerce and web promotion while doing so, and the concepts of relevance and materiality are essential to a legal system, as is access to information, finding facts, determining what law may apply to a case, and so on.
So when it comes to a concept like breaking down a body of information, like the web, into specialized topics, and having importance within those topics use the appropriate metrics, I have some experience with understanding how that might fit well into retrieving information within the context (or topic) of the legal field. My searches for legal information don’t value an algorithm like PageRank much.
I would guess that breaking topics down into subtopics and then even smaller subsubtopics might make sense if it could be done in some automated manner, especially for areas that receive a lot of searches. Looking at what people are querying might help with that.
“Static” implies applying some kind of ranking independently of a query, which would include PageRank or HITS, or other methods that could be done before anyone even performed a search.
Applying a static rank that enabled documents to be broken down into topics could make sense, and could possibly be applied to tagging (automated tagging – a term used in the patent filing) documents with more than one topic.
How much each document fit into a topic could be weighted, and the patent describes assigning positive, neutral, and negative weights to documents for different topics.
Thanks – that’s an interesting observation. I’m going to have to pay some more attention to search results at live.com.
Don’t know if that would have helped today. He had you by three hours. 🙂
“I can’t help but think of universal and blended search results being served by the search engines these days”
Bill I would agree with that, but oddly I am seeing it applied more in Live. I am leaning towards themes having more influence, actually I shot Dave some info about my observation prior to coming over here. And that reminds me, can you put a 60 second delay on his posts so I can beat him out.
Hey Bill,
You know, I love these kinds of things. What says it for me though, in a vague way, is that the engines are applying legal data in a way that is very difficult to personify in a way meaningful to the spiders… Bear with me here. Just for a sec 🙂
A ‘reasonable man’ would be able to assess the relative importance of the variables to determine which weighed the most heavily in favor of a page being relevant. The algorithms are still struggling with that. No worries there, it’s a growth process.
When they finally figure out how best to display their results to the benefit of the mind-set of the ‘reasonable man’s’ search query intent, we might see some real progress.
To that end, ‘static’ has some meaning, but only as a part of a far greater whole. ‘Themes’ likewise depend on the search query intent as illustrated by the ‘Tiger’ ‘Woods’ example of yesteryear.
The closer the algos can get to ‘reasonable assertion’ the better off the user will be, but then again, the inevitable rash of spam/manipulation will become a real threat as undertstanding of the core artificial knowledge becomes known.
Quality scores are good, but are again limited by algo parameters. I think they are all doing a great job. I am very eager to see more about Ask.
In the long run it’s not a case of understanding how man thinks and applying it to the algos in a positive manner, it simply reverts back to the age old issue of the more sophisticated and complete a system becomes, the more it requires protection. It’s almost reverse engineering before the engineer has started.
I am also interested in the amalgamation of video and imagery into the results of the Google algo. It sounds great, and in many cases may be highly applicable, but there are cases where it wouldn’t be, without age appropriate settings to name one concern. And they aren’t as obvious always as may seem.
Relevancy and being ‘material’ are both inherently necessary in the algos, but due to the nature of the programs, makes them also inherently open to attack, even if it is short lived, and every short-lived attack is followed by another as new ‘loop-holes’ are found.
The solution? I haven’t one. The engines are doing a great job in their own ways to combat what they can and to promote what they think best. I personally don’t think we can yet justify or expect to impose the appliction of legal ramifications applicable to humans on to spiders (not that you were suggesting that), in the broadest sense possible -It’s not like we’re planning on putting GoogleBot in jail for falling aslep under a hairdryer, but you get my drift. 🙂
(yes that is a Florida Law – or was….)
More please! I love your viewpoints on the patents as I very seldom get the chance to review them myself, and I greatly appreciate both your insight and your opinions on them. You are far more intelligent than I could ever hope to be, so thanks Mr Slawski 🙂
Thank you, f-lops-y. Great points, and I think that you underestimate yourself.
Your point on algorithms and attack is a really good one. You’ve probably seen this, but if you haven’t, you might enjoy it:
The Cost of Attack of PageRank (pdf)
The weighing of evidence does require some standard that a search engine could struggle to ever achieve, and I think that you’re right that it’s not a matter of approximating how a person thinks.
Don’t worry. More patent analysis to follow…
Hello,
This discussion really has given me an appreciation for how far the net still has to evolve.
I think, as vertical neighborhoods become densely overcrowded, rankings will be assigned by either mob rule or will be personalized somehow. I’m afraid that ‘popular’ may become synonym with trustworthy though.
Do you think Microsoft’s dabbling with behavioral data could be expanded into calculating personalized rankings based off of people’s personality type? I mean, directing methodical personality types to one kind of site and amiable’s to another?
Popularity is an interesting path to follow, and to some degree is in use.
For example, more popular queries are cached at the search engines rather than being performed over and over.
User data from search sessions is being used in places like spell correction.
“Trustworthy” is a slightly different matter. Perhaps in the use of social networks, and relationships that you indicate between others of your social network, the element of trustworthiness might play a role.
I hope that personality type doesn’t enter into what search engines show people. I could see how viewing habits might enter into providing search results, however.