Patent Documents Mentioning the Open Directory Project in 2006

In the recent predictions for 2007 post, a commenter asked about the impact of the Open Directory Project and its data on research in 2006, and what the information might be used for in 2007. I was also asked what I thought was the most interesting use.

Those were some great questions, and I really hadn’t tracked carefully the appearance of the Open Directory Project in published patent applications and granted patents, though I had seen it mentioned in a number of papers in 2006. I thought it might be interesting to find mentions of the DMOZ in the patent filings that were either granted, or published publicly as applications during the year, and see how the Directory was referred to, or the data from the directory was used.

Below are links to those documents, who they are assigned to, and a brief synopsis of why the DMOZ was mentioned within them. In the conclusion, I list some of the more interesting of those, and the one that I liked the most.

Patents granted in 2006

Heck.com

Virtual directory (7,149,743)
A virtual directory that overlays a number of other directories, including DMOZ, and would allow users to access the information in those other directories. Note: a visit to the URL results in a “site under construction message.”

Xerox

Meta-document management system with transit triggered enrichment (7,117,432)
System with user directed enrichment and import/export control (7,133,862)
These two patents appear to be related, and use the same text to describe how they would use DMOZ information – context information, from a source such as DMOZ, could be used to enrich query results.

Sun Microsystems

System and method for distributed real-time search (7,099,871)
DMOZ categories as query refinement options

System and method for multiple data sources to plug into a standardized interface for distributed deep search (7,013,303)
Would use categories from a source like DMOZ to offer to searchers looking for information.

Microsoft

System and method for evaluating and enhancing source anonymity for encrypted web traffic (7,096,200)
DMOZ was used as a source of information for an experiment to test the effectiveness of the process defined in this patent:

To demonstrate the effectiveness of source identification based on a statistical comparison of traffic signatures of encrypted Web traffic, the procedure of an actual study and its results are described below. In that particular study, traffic signature information is collected on a sample of just under 100,000 Web pages, from a wide range of different sites. The pages were obtained from the DMOZ Open Directory Project link database (http://dmoz.org), half of them chosen from various categories of “sensitive” site to which an adversary might be interested in spotting visitors, and the other half chosen randomly.

Google

Identifying navigation bars and objectionable navigation bars (7,089,490)
An Open Directory Project page was used as one of three examples to show how the process involved in rewriting and reconfiguring pages for smaller screens would work.

Interface and system for providing persistent contextual relevance for commerce activities in a networked environment (7,089,237)
Used as an example, structure-wise, of a user-specific meta catalog that could be generated by the system described in this patent.

E-Nvent USA Inc.

Method and system for mapping and searching the Internet and displaying the results in a visual form (7,085,753)
Mentioned as the directory that Google provides to users, rather than a directory generated from the search data that the search engine collects as it crawls the web.

Thebrain Technologies Corp.

Method and apparatus for sharing many thought databases among many clients (7,076,736)
An example of a source of information that could be reorganized in a meaningful manner by using the process described within the patent:

These matrix-sharing methods could be ideal for an Internet search engine which is updated and kept current automatically by massively distributed “spider” computer programs each updating a common shared source of Brain data items of its findings. As of this writing, a service permitting Internet users to search for, and navigate amongst websites using the thought-link structure of the present invention is available to the public at the URL www.webbrain.com. That type of service can be created, in large part, by applying the automatic “spidering” techniques for matrix creation (described above) to a publicly available Internet directories such as the “Open Directory Project” (www.dmoz.com).

America Online

Searching content on web pages (7,047,229)
Category searching (7,007,008)
Category searching could be done by using category identifiers taken from a source like the DMOZ.

CNET Networks, Inc.

Apparatus and method for delivering information over a network (7,016,892)
Using content from a source like the Open Directory Project to serve along with advertisements in response to a query.

Patent Applications Published in 2006

Google

Categorizing objects, such as documents and/or clusters, with respect to a taxonomy and data structures derived from such categorization (20060242147)
DMOZ as an example of a taxonomy that could be used in a probabilistic hierarchical inferential learner

Systems and methods for providing a graphical display of search activity (20060224938)
Systems and methods for managing multiple user accounts (20060224624)
Systems and methods for providing subscription-based personalization (20060224615)
Systems and methods for combining sets of favorites (20060224608)
Systems and methods for modifying search results based on a user’s history (20060224587)
Systems and methods for analyzing a user’s web history (20060224583)
– Google
In a user profile for a personalized search system, described in a number of related patent applications, levels of interest might be assigned to topic categories:

For example, by examining one or more of the various events a user profile may be created indicating levels of interest in various topic categories (e.g., a weighted set of Open Directory Project (http://dmoz.org) topics).

Microsoft

Associating supplementary information with network-based content locations (20060224662)
DMOZ used as an analogy, of associating a page with a category.

Dispersing search engine results by using page category information (20060004717)
Used as a source to help categorize web pages:

[0036] A category tool 218 is linked to index 116, and identifies one or more categories for each of the retrieved web pages as a function of the parsed content (i.e., identified document data) and one or more external data sources such as the Open Directory Project (ODP) or index data regarding previously categorized web pages.

As known to those skilled in the art, the ODP is one of the most widely distributed databases of content classified by humans. For example, assume the ODP has categorized a web page having the URL www.gs.com under Business.fwdarw.Finance, and has categorized a web having the URL www.gs.com/venturecapital/under Business.fwdarw.Finance.fwdarw.Entrepreneurship.

If the category tool 218 is given the URL www.gs.com/venturecapital/foo/bar.html to categorize, it will initially query the external data sources and/or index for a matching URL. If the URL www.gs.com/venturecapital/foo/bar.html is not found, the category tool 218 will then query the external data sources and/or index for www.gs.com/venturecapital/foo. Finally, if www.gs.com/venturecapital/foo is not found, the category tool 218 will check for www.gs.com/venturecapital and will assign the web page (i.e., www.gs.com/venturecapital/foo/bar.html) with the category Business.fwdarw.Finance.fwdarw.Entrepreneurship.

McAffee, Inc.

Reputation of an entity associated with a content item (20060253584)
Indicating website reputations based on website handling of personal information (20060253583)
Indicating website reputations within search results (20060253582)
Indicating website reputations during website manipulation of user information (20060253581)
Website reputation product architecture (20060253580)
Indicating website reputations during an electronic commerce transaction (20060253579)
Indicating website reputations during user interactions (20060253578)
Determining website reputations using automatic testing (20060253458)
When someone attempts to access unsafe content, they may be given a choice of alternatives based on category information from sources such as DMOZ, the Yahoo directory, and other places.

Plain Sight Systems, Inc.

System and method for document analysis, processing and information extraction (20060155751)
Building a contextualized search by starting crawling a certain DMOZ categories.

Utopy, Inc.

Automatic, personalized online information and product services (20060136589)
DMOZ used as an example of a topic tree, that could be generated from a users interests.

Xerox

Method and apparatus for explaining categorization decisions (20060136410)
A categorizing program may assign documents topics such as found in the ontology used by the DMOZ.

Oculus Info Inc.

System and method for interactive multi-dimensional visual representation of information content and properties (20060116994)
Filtering of sites listed in shopping categories within the Open Directory so that more informational results are shown:

[0104] The results views 209 also allow the analyst to annotate documents. Annotated documents appear with a yellow pin 217. The analyst can also hide categories and documents (e.g. hide all documents from sites categorized as “Shopping” in the open directory project web site classification dimension). When one or more documents are hidden they will disappear from their position in all the result views. Instead they preferably appear in a separate hidden category (not shown). In this way the analyst can remove documents from consideration, though with the opportunity to restore them if necessary. The hidden category can also be set to avoid displaying its aggregate results, instead showing a single line indicating the number of results it contains

Topixa, Inc., DBA Rawsugar

User interface for conducting a search directed by a hierarchy-free set of topics (20060059143)
Conducting a search directed by a hierarchy-free set of topics (20060059135)
Creating attachments and ranking users and attachments for conducting a search directed by a hierarchy-free set of topics (20060059134)
DMOZ top level categories used as a source of topics for a search methodology.

[0106] As mentioned above, according to one aspect of the invention, a registered user can add a topic to the set of topics, and can add one or more URLs attached to that topic. Therefore, in one embodiment, an initial set of topics is pre-defined. In one version, this set consists of the top level topics from the Open Directory Project (ODP), also known as DMOZ, run by Netscape Communication Corporation, Mountain View, Calif. For further information, see www.dmoz.org, and http://dmoz.org/about.html.

Accenture Global Services GMBH

Active relationship management (20060036461)
As a source in attempting to determine whether organizations are related to each other.

IBM

Methods, apparatus and computer programs for characterizing web resources (20060026496)
As an example of a resource that classifies websites rather than just web pages.

Icosystem Corporation

Methods and systems for interactive search (20060010117)
DMOZ as an example of a web directory.

Technorati

Ecosystem method of aggregation and search and related techniques (20060004691)
To help assign topics to blog posts based upon outbound links, and if they appear in a directory like DMOZ:

[0033] According to a specific embodiment, content may be classified based on links to an established topic directory or ontology, e.g., by looking at each piece of content and identifying outbound links and unusual phrases. An outbound link is then checked against an ontology (e.g., DMOZ (see http://dmoz.org/) or any other suitable ontology) and based on the link pattern, the content is automatically tagged as inside of that particular category. Then, a relevance weight may be assigned to the document with reference to the author’s relative authority inside of that category (see below) as well as inbound links to that document inside of that category. This weight may further incorporate self-categorization, (e.g. “tags”) of blogs and posts.

Unassigned

Computer Graphic Display Visualization System and Method (20060288023)
Mentions the use of thebrain.com’s search service for DMOZ.

An alternate hierarchal representation of information is provided by TheBrain.com, Santa Monica, Calif. 90404, www.thebrain.com, which has developed a dynamic information presentation applet showing hierarchal links between data elements, which may include hyperlinks to associated resources. More recently, TheBrain.com has developed an Open Directory Project search service for presenting search results within their applet framework.

Large-scale metasearch engine (20060184514)
Used as a source of data to find pages for an experiment.

[0041] An experiment was carried out to evaluate the Search Engine Discovery Component of the instant invention. The experiment included the following steps. [0042] 1. The RDF dump from http://dmoz.org, was downloaded. DMOZ is said to be the largest human-edited directory, containing millions of Webpages. A total of 519 Webpages are collected as a result of random selection, each having at least one form. [0043] 2. A manual check revealed that 307 of the 519 pages contain at least one search engine form. [0044] 3. The discovery program reported 286 search pages from the same collection of 519 Webpages. [0045] 4. 286 URLs appeared in both the manual check and the report from the discovery program. 21 URLs were listed only in the manual check, meaning that the search engine discovery component missed 21 search engines. There was no misclassification. The discovery success rate is 93% (286/307).

Method for search result clustering (20060117002)
As fill-in categories for clustering:

[0030] Flexible combinations of keyword classes and other class identifiers may be used. For example, document classes from a conventional classification system (such as a web page directory like the Open Directory Project, http://www.dmoz.com) can be used as the KWAC classes of a document associated with some index keyword(s) when there are no appropriate keywords that are related to the index keyword(s) in the document.

Process for matching vendors and users of search engines so that more valuable leads are generated for vendors (20060074890)
In searching for vendors in response to a search query, DMOZ might be one of the resources used to find those merchants.

[0061] This method can be applied not only to search-engines that index the entire web, but also to engines that cover only a portion of the web, and even engines that are restricted to just one web-domain. It can also be used in directories (such as dmoz) that contain vendor descriptions in addition to other general non-commercial content.

Conclusion

The patent filings mentioning the Open Directory Project range from using the DMOZ as an example of a directory or ontology on the web, to a source for experiments, to some even more interesting uses. Here’s a quick summary of some of those.

1. Microsoft used DMOZ data in an experiment regarding encrption of web traffic.

2. Thebrain Technologies Corp., created a search interface for the Open Directory Project that allows the data within it to be organized differently.

3. McAffee, in a family of patent applications, would use content like that from the DMOZ as a safe alternative to unsafe sites.

4. Plain Sight Systems, Inc., writes of using the DMOZ as a seed starting point for web crawling for a contextualized search.

5. Oculus Info Inc., would consider filtering out of results of informational searches pages that might be found in shopping categories at DMOZ.

6. Technorati would look at shared outbound links from blogposts and the DMOZ to help provide categories to posts.

I’m not sure how effective it might be, but the Technorati use of DMOZ information was the most interesting to me. I imagine that the intent behind an outbound link may not match the DMOZ category for a link to that page, but it’s a creative use, and I’d like to see how effective it might be.

Share

8 thoughts on “Patent Documents Mentioning the Open Directory Project in 2006”

  1. Whow, this is a great overview! Thank you very much, Bill :-)

    There´s several old acquaintances on the list, who´re using ODP data pretty regularly for research, especially the big search engines. McAffee is a bit of a surprise.
    That Technorati might use ODP data is new to me, too – and yes, this is rather an interesting patent! It would be interesting to know how the technique described works against various types of spam blogs.

    Of the unassigned patents, “Large-scale metasearch engine (20060184514)” reminds me of http://www.incywincy.com/ – it allows search via the search forms on result sites.

    Lots of interesting stuff… I guess I´ll be busy digging through the list all the next weekend ;-)

  2. You’re welcome, Chris.

    Incywincy looks interesting, and I see how you might think that unassigned patent application could possibly be related. Looking at the names of the people involved, I’d guess that if it were to be assigned to any company, it would go to Webscalers LLC. Incywincy is pretty interesting, though. Thanks for pointing it out.

    Would love to hear your thoughts about any of those, if something jumps out at you.

  3. Phew, this was a big pile of “homework”! There´s a Microsoft and two Xerox patents that I found rather interesting.

    Dispersing search engine results by using page category information, Microsoft (20060004717)

    This one offers, in passing, some insights how a user of ODP data might try to balance notorious problems of classical hierarchic categorization:

    1) The authors want to use a directory to assign a category to specific URLs. But directories, in contrast to search engines, don’t try to list each single URL, the majority of listings are sites or large chunks of sites. Therefore, in many cases, the specific URL will not be listed in the directory. But the next higher level or the domain might be – so the problem is to find the closest related listing. What the system does – you already quoted the text above – is shorten the URL step by step, and check back if the modified URL is listed in the directory. (0036).

    2) Usually documents are listed only in one category of a directory: if you miss the category while browsing in the directory, or if your query doesn’t fit, you may not find the document. On the other hand, if a document is listed in several categories, which category is the better fit for a given query? And, not to forget, the categorization of a document might be wrong or outdated. The confidence levels which are used by the system in the patent to describe how well a listing actually fits into its category can address these problems: “For example, a travel page on Hawaii might have category “Recreation\Travel” with confidence of 80% and a category of US\States\Hawaii with a confidence of 75%.” (0036) Unfortunately the patent does not explain how the confidence level is defined… link structures, or comparisons of page content and category path or content, are options that come to mind.

    3) Often the best results for a given search term are not to be found in one single directly-corresponding category, but either in this corresponding category and its subcategories. Or spread over categories in various branches. The system described uses not only the content of the directly-corresponding category, but the content of the subcategories, content of categories in other branches, or it might even offer filtering options. (0043 and 0044)

    Meta-document management system with transit triggered enrichment, Xerox (7,117,432)
    System with user directed enrichment and import/export control, Xerox (7,133,862)

    The topic of these patents is personalization of documents – not so very new anymore, but the way how the authors describe their invention makes for a rather exciting read. Traditionally, we think of a document as a passive object that is static, except if a human being actively modifies it. In these patents, the paradigm is put from its feet on its head: the documents are described as entities that seem to act on their own. E.g. a document might initiate its enrichment with additional data, or launch search processes and display the results, before the user starts working or while he’s working. Also the document’s behaviour might depend of who the user is. The authors go as far as using the terms “personality” and “soul”… For speculations how far the limits of personalization could be pushed – or for what good or evil personalization could be used – this is a rather thought-provoking approach.

  4. Wow,

    Surprising to see your analysis of these patents this morning, but much appreciated. Thanks, Chris.

    I had spent some time with the Microsoft patent in trying to get a sense of what they were trying to do, but I think you’ve done a great job of capturing its essence.

    I’m going to have to revisit those Xerox patents based upon what you’ve written. The living document approach you describe makes them sound like they deserve much more attention that I gave them when I originally created this post.

  5. Hm… the essence of the Microsoft paper… I think they try to solve the Jaguar problem: the user types “Jaguar” and the search engine doesn’t know if he´s looking for a car or a cat ;-)
    One possible solution is to identify the various topics that the user might have had in mind, and then make sure that the result list offers results for each of these topics. E.g. if the first page of the result list contains ten links, these wouldn’t be automatically filled with the ten links that rank best for the keyword “Jaguar”, but with let’s say the five best pages for the car, the three best pages for the cat, and two for other sorts of “Jaguars”.

    “Thus, the need exists for a search engine that displays search results related to various topics or categories on a single page of search results independent of conventional rankings. By displaying such dispersed search results, the user is able to view a variety of results on the first page of results.” (0006)

    The directory is used to identify the possible topics, and match documents and topics: clearly separating topics like “Jaguar the car” and “Jaguar the cat”, and sorting documents properly into these topics, is one of its biggest strengths. Only to make efficient use of the directory, one needs to balance some of its weaknesses – that´s the part of the patent that I described above.

  6. Pingback: u1amo01
  7. What a wonderful response to Chris’s query! There is no-one to touch you in the search patents department Bill.

    Sorry that I have only just seen this. I have your feed, but I was frantically busy over December/January.

  8. Thanks, Jean.

    It’s good to see you. I hope that your busy winter was also a good one.

    Chris asked a few great questions, and I really enjoyed doing some research to find a few answers.

Comments are closed.