What Makes a Good Seed Site for Search Engine Web Crawls?

Would search engines be better if they started web crawls from sites like Twitter or Facebook? Wikipedia or Mahalo? DMOZ or the Yahoo Directory?

The Web refreshes at an incredible rate, with new pages added, old pages removed, and words pouring out from blogs, news sites, and other genres of pages. Ecommerce sites showcase new products and eliminate old ones. New sites launch and old domains expire.

Search engines attempt to keep their indexes of the Web as fresh as possible, and send out crawling programs to find the new, update changes, and explore disappearances. Failure to do so means outdated search engines that deliver people to deleted pages, overwritten content, and stale indexes that miss out on new sites.

When a search engine starts crawling the Web, it often begins by following URLs from chosen seed sites to explore other pages and other domains. But how does a search engine choose those seed sites?

Seed sites might be domains like the Open Directory Project or the Yahoo directory, which link to a wide range of sites of different topics and regions, editorially controlled in selection of the pages contained within them.

But a search engine doesn’t necessarily have to use those particular sites as places to begin, and may choose others.

The choice of seed sites can have a dramatic impact upon the quality of a search engine and how it covers different topics and geographical areas in its index. Poorly chosen seed sites could mean low quality search results, and even more web spam showing up in response to searches.

A Yahoo patent application describes how the search engine might choose amongst sites to use as seed sites to discover URLs to other pages on the Web.

Host-Based Seed Selection Algorithm for Web Crawlers
Invented by Pavel Dmitriev
Assigned to Yahoo
US Patent Application 20100114858
Published May 6, 2010
Filed October 27, 2008

Abstract

A host-based seed selection process considers factors such as quality, importance and potential yield of hosts in a decision to use a document of a host as a seed.

A subset of a plurality of hosts is determined, including some but not all of the plurality of the hosts, according to an indication of importance of the hosts, according to an expected yield of new documents for the hosts, and according to preferences for the markets the hosts belong to.

At least one seed is generated for each host of the determined subset of hosts, wherein each generated at least one seed includes an indication of a document in the linked database of documents. The generated seeds are provided to be accessible by a database crawler.

Revisiting the same seed sites on a regular basis may not result in discovering a large number of new URLs. The Yahoo pending patent provides a glimpse at how they may compare and choose amongst potential seed sites.

It tells us that the seed site selection process can be improved if the choice of particular selected seeds results in:

  1. A relatively large number of previously undiscovered documents being discovered and processed.
  2. The crawling of relatively more of more important hosts and documents, and fewer of less important hosts and documents.
  3. A desirable distribution among markets or categories of sites.

Candidate seed sites may be judged based upon measures of:

  • Quality
  • Importance
  • Potential yield of hosts

Quality (or lack of quality) of a site as a potential seed site could be based upon things such as the site:

  • Having few outlinks,
  • Being a spam page or having outlinks pointing to spam pages,
  • Containing pornography content.

The patent filing tells us that high quality sites are chosen as potential seed sites since, as the starting point of a crawl, it’s likely that starting with a low quality domain would likely result in many more low quality pages being crawled.

Importance of a seed site might be based upon a “host trust” score or rating or other attribute associated with that host, which generally provides an indication of:

  • Popularity
  • Trustworthines
  • Reliability
  • Quality
  • Other characteristics of a host

PageRank could be considered one type of host trust score, but other factors could be used as well.

Potential yield of documents, or the potential for the discovery of new URLs, for a host could be calculated based upon statistics gathered from past crawls of that host.

We’re told that markets are typically often distinguised by geography, so a seed site selection process looking to yield many new URLs may look for different seed sites based upon geography that can help the search engine help find URLs from different countries and regions.

Different thresholds may be chosen for seed sites in different markets (or according to some other characterizaton), because some markets are less dominant and may have fewer hosts and fewer “important” hosts. This can keep relatively dominant markets from having so much influence that “few or no seeds are selected for hosts in less dominant markets.”

Conclusion

I’m not sure that I’ve seen a detailed discussion before in a patent or white paper from one of the search engines on what they might look for in choosing a seed site for their crawling process.

Most discussions about web crawling by the search engines often provide examples of sites like the Yahoo directory or DMOZ as entrance points for the crawling and discovery of new pages on the Web.

Because of that, it’s interesting to see some of the criteria discussed that a search engine might use to identify a seed site other than those directories. Would the Wikipedia make a good seed site? Possibly. How about Twitter or Facebook? I’m not quite as sure.

We know that the search engines have been placing more emphasis on quickly including content from sites like Twitter in their indexes to give us the feeling of being delivered very timely information. Are they also following links from those services, treating them as seed sites, to discover new pages and new content upon old pages? What does it mean if they are?

Share

66 thoughts on “What Makes a Good Seed Site for Search Engine Web Crawls?”

  1. From my experience ping tools do this job very well. Instead of seed sites, pinged links are published in seed pages (ex: last ping, last update pages) wich match freshness and high Google pagerank.
    In my opinion Yahoo Directory / DMOZ are has-been. Directories can’t compete in term of responsiveness.
    Renaud JOLY – SEO Manager (Lille – France)

  2. Hi Renaud,

    Thank you. Those are some good points.

    Reading through the patent filing, I was wondering how likely it is that any of the search engines would find a high percentage of new URLs to crawl through if they used Yahoo and DMOZ as seed sites. I know that new URLs are added to those directories, but I imagine the rate isn’t close to what we see at places like Facebook and Twitter. But, are twitter and facebook good candidate seed sites? I’m not sure.

    The use of ping tools are a way of letting search engines know about new content, but how much do they help in the discovery of new URLs?

  3. A easy way for search engines to find new sites would be to crawl hosting company’s servers. I have to agree, with any good directory that takes time to reviews sites, not a good place to find new URL’s. Ping tools are a great way to get new pages and content to the search engines but I don’t think that is the best for new sites.

  4. I do not think Yahoo or Dmoz are being used as seed sites, i still see some deep categories in these which are not cached and indexed…

    I think fresh sites can be found from the web anywhere but as you said quality of seed resources matters to provide quality results

  5. I have always been told that if you register a web domain for more than just a year, say three to five or more, then that tells the search engines that you are in it for the long haul. I am assuming that this theory means that they will consider you a more trusted resource and allow you better ranking.

  6. I’ve just learned about seed sites now. I know how crucial it is for search engines to begin with up-to-date seed sites however, these seed sites must be chosen carefully. I am also not so sure about twitter and facebook as good candidates because these sites are used to socialize, a way to communicate with family and friends, they’re not really there to provide information.

  7. Hi Mike,

    Good points. Buying user-data information from hosting companies would be another way to find out about new URLs, and if that information also included data about how many times people visited sites, it could also be useful in other ways by the search engines as well. The search engines could also possibly learn about new URLs from toolbar data as well. But would those be the most efficient way to learn about new URLs?

    Ping data might be helpful in discovering new URLs where plugins have been developed that make it easy for the sites those are on to ping search engines. Blogging software like WordPress does that well, but many sites don’t have ways to ping search engines easily, so that would potentially be a limited solution.

  8. Hi Mike,

    We really don’t know for certain which sites the search engines might be using as seed sites to begin crawls with. Both Yahoo and DMOZ may have been good sites to use a number of years ago because they do cover a wide range of topics and regions, but it is possible that they aren’t being used in that manner now, or that they aren’t relied upon as much as they may have been.

    Your point about some deep categories within those directories not being cached and indexed is a good one, though it’s still possible that the search engines may still be using those pages as launching points for visiting other URLs without indexing those pages.

  9. Hi Mark,

    That’s a little off topic from the subject of this post, but something that I’ve seen as well.

    It most likely originated with a Google patent filing from March of 2005 on how historical data might be used by Google, and there was a statement in the patent application that mentioned that a search engine could look at whois data to see how long a site was registered for. Sites that were only registed for a year would possibly be more likely to be spam than one registed for longer than a year, because a spammer would likely try to spend as little money as possible on registereing domains.

    Unfortunately, many people with absolutely no intention of spamming search engines also are likely to only register a domain for a year in advance instead of longer, because there really isn’t much benefit in registering for longer most of the time. As a signal of whether or not a site is spam, it’s a very weak one. I know that not too long after that patent application was published, and a number of people wrote about it, including me, some domain registrars started citing it as a reason to register for longer than a year at a time, but the truth is that we have no idea if Google looks at that information or even cares much about it.

  10. Hi Andrew,

    Thanks – it can be really helpful to the search engines to choose the sites carefully that they use as seeds for their crawls.

    I agree with you that twitter and facebook may not be good candidates, but there are a lot of people who use those services to link to new sites that they’ve discovered that they want to share with others. While that may be helpful in allowing the search engines to find lots of new URLs, in some ways those services are a free-for-all in that anyone can post a link to almost anywhere regardless of how good or bad the resource they are pointing to might be.

    So the quality of pages being pointed to might be questionable, unless the search engines look at other potential factors as well in deciding which URLs to consider, like the past history of links that specific individuals might have posted in the past. I don’t know if the search engines want to go into that kind of reputation analysis on social networds for the sake of finding and crawling new URLs.

  11. Great article, Bill. I agree with Andrew. Sites like Twitter and Facebook wouldn’t be good candidates as seed sites. Anyway, don’t you think choosing a seed site would be a little bias against the chosen site’s competitors?

  12. I think that the topic of a page is also important for the selection of a seed side. To crawl the web in order to find high quality content, you have to start at all topics including, manually controlled directories like yohoo or DOMZ or well known thematic websites, for example Techcruch as seed site for technology. Search engines will use both types of seed sides, all topics including nodes and thematic authorities…

    Gretus

  13. The seed site needs to be always content rich, with original updates on a regular basis.

  14. Using Facebook and Twitter is not a good idea in my opinion since it is a social community and not literally meant to advertise. Although a lot of companies use social communities to increase knowledge of their presence in a certain market by using Facebook Ads for example, a lot of companies are also hosting contests and other various campaigns in their Facebook Groups. There are a lot of benefits and mistakes with this thought, you can get a lot of decent information about various companies around the world from Facebook but at the same, most data would be about regular people….and that is not important for a crawler…

  15. Facebook and twitter both suffers from having only one kind of links. You only rarely spread the url of a great mechanic, no matter how good he or his site is. On the other hand I dislike DMOZ as it is more or less dead in several languages, Swedish for example (my language) seems to be totally forgotten lately.

  16. …”I have always been told that if you register a web domain for more than just a year, say three to five or more, then that tells the search engines that you are in it for the long haul.”

    Mark – I have put up a lot of sites and I have found this to be a small factor, if any in overall rankings. I think of Google’s 200 factors this is low on the algorithm. I’m not saying that it is not a factor, but I think it just is not a heavy influence on your rankings.

  17. Hi dominic,

    I do think that there’s some potential bias in the choice of a seed site, though I also expect that the majority of pages are likely not going to be chosen as seed sites for search engines to begin their web crawls. Twitter and Facebook may be better choices than many other sites because they do introduce the possibility of helping the search engines find a high return of new URLs, though the quality of many of those new URLs may be questionnable. Yahoo and DMOZ may provide links to a higher quality of URLs because of their editorial control and guidelines, as well as a wider range of topics and regions, but I don’t know if they would provide as much in the way of new URLs to visit.

    It’s possible that Twitter and Facebook may be better choices than Yahoo or DMOZ because of that higher yield of new URLs, even though it may mean that a search engine would have to spend more time gauging the quality of pages linked to by Twitter or Facebook.

    Which sites would you consider to be competitors of Twitter or Facebook, and would those sites potentially be better seed sites?

  18. Hi Gretus,

    Very good points. Thanks.

    It would be interesting coming up with a way to explore different sites to gauge their potential as a seed site for search crawls. Would Techcrunch be a good seed site for search engines to use to identify new URLs in tech related topics? They do link out regularly to new technology related sites, and there is a level of editorial control in the sites they choose to link to. Would other tech related sites be a better choice?

  19. Hi Green Laser Geek,

    How content rich would a seed site need to be, though? Would something like delicious or Google bookmarks be useful as seed sites, even though they may not provide much more than URLs and tags? Imagine the search engines using services like those to discover new URLs, and possibly following URLs that have only been bookmarked more than a certain number of times, by a certain diversity of people?

  20. Hi Rakeback,

    The web is more than commercial sites and advertisements, and search engines want to index pages that aren’t commercial as well as sites that are commercial, so the social aspect of twitter or facebook doesn’t necessarily preclude them from being used as seed sites. As a matter of fact, the search engines may want to make sure they include seed sites that focus upon non-commercial topics, and even regular people.

  21. Hi Charlotte,

    Those are good points. One of the things that the patent raises is that it is important to identify a number of seed sites instead of just a few, so that they make sure that they crawl new URLs in different regions, in different languages, and for topics and markets that may not be dominant. DMOZ wouldn’t be helpful for identifying new pages in Swedish, so it’s likely that a search engine would attempt to find a seed site other than DMOZ that would help them find pages in Swedish.

  22. Hi Cam,

    I agree. I also think that it’s impossible to tell if the length of registration of a web site domain is a factor that search engines even consider at all.

    There are so many other things that a search engine might consider, including the actual content of a page, link data associated with a page, user data collected about how people browse a page or search for it, and more.

    The patent filing that cited the length of domain registration, did so as a possible signal that a site might be engaging in spam, along with a very wide range of other signals, and it’s possible that it would take a number of those signals together to have an influence in the ranking of a page in search results. Again, the only place I’m aware of that any of the search engines have discussed using domain registration length as a possible signal has been in that patent filing and there’s no guarantee that Google has used the process described in that patent, or that specific part of that patent filing.

  23. Really informative article Bill. I know you talked a little bit about outbound links, but does inbound links to certain pages affect which seed directory is selected?

    It is truly amazing how massive the web has become, and yet there are engines like Google and Yahoo that can still keep up with it.

  24. Hi Keith,

    I’m not sure that the search engines would be too concerned with which pages link to potential seed sites. The importance of a seed site is that it can be a good starting point for a web crawl by a search engine, and help the search engines find new URLs to include in their index.

    Of course, since the search engines may look at things like PageRank to help gauge whether or not a site might be a good seed site, links to the pages of potential seed sites are important, but the point behind the selection of a seed site is in helping the search engines crawl pages on the web.

    Note that seed sites don’t necessarily have to be directories.

  25. Nice read as ever, I need to make a point of visiting here more often even when busy.

    I have read a lot that DMOZ and the like are good starter points and although they have a historic place in the web I would think its a terrible place to start. The information in DMOZ is so out of date its not even funny and to make matters worst sites like FreshDrop are selling domains in a why that a DMOZ listing would actually add value to the sales price.

    I would think that a sead site needs to be selected manually and trying to use calculations to do so would be a disaster. One thing is certain though, if you were a webmaster looking after what was classed as a regularly used sead site you would be able to make a lot of money very quickly – which is against the google ethic of selling links but would happen so in turn the sead site would need to be removed from the sead list….. again I dont know that this can be done automatically? I have not really answered any question here but just thrown a few thoughts out on sead sites – if anyone has any thoughs on my ramblings please let me know!

  26. Hi Jimmy,

    Thanks.

    I’m not sure how many new sites get added to the DMOZ these days, but I’ve seen some fairly stale information there as well. I’m not sure how often links are checked on the site, especially in areas that don’t have editors. I also wonder how many new URLs are added to the Open Directory Project on a regular basis.

    The best way to decide upon a seed site might be to select some that might be good potential candidates based upon a number of rules or guidelines, and do some test runs on them and examine the results for quality, and how many new URLs using them as seed sites might yield. You would want to do that a few times to see how frequently new URLs tend to be found on the sites.

    I’m not sure that any of the search engines would divulge that they might be using specific sites as seed sites beyond some of the older directories that we sometimes see referred to as possible seed sites like DMOZ or the Yahoo Directory. Looking through your log files or analytics, I don’t think that you would be able to tell whether a search engine was starting off a web crawl on your site, or in the middle of one after starting elsewhere.

  27. There is a category in DMOZ which I’ve been trying to get a specific site into for about 4 years now, still nothing. That category has completely outdated info and a couple of sites are no longer even related any more, so I cant imagine any of the engines actually use it as a seed site any more.

  28. Hi Steve,

    Getting a site into the DMOZ can be pretty hard for many categories, and I’ve seen outdated information as well.

    Have you reported the change to the sites listed that are no longer related?

    Would the site you are trying to get in fit as well in a different category? If so, and you haven’t submitted the site in a while, that might be worth trying.

    What the DMOZ does have going for it is that it does have a category structure that search researchers seem to like, covering a wide range of topics. But I think you’re right that it may not be fresh enough to work well as a seed site.

  29. I agree that DMOZ is a bit archaic these days. Perhaps a combined algorithm using the major social sites in addition to quality and informative unique content with few outbound links is, in my humble opinion a good start. Bill has some very valid points and poses some good questions. It will be interesting to watch as the major SEs develop and the results that come from their changes.

  30. Yes the role of social media sites will continue to have serious implications for search and SEO and will drive the relevancy of social media marketing Cheers

  31. Hi SEOtools,

    I’m not sure if I would be surprised to see the search engines combine some kind of social network reputation algorithm with a new URL discovery approach to learn about new URLs on the Web. Web crawls might be better informed by paying more attention to social activity and browsing activity rather than just by following links from directories.

  32. Hi Jeff,

    Social sites have been around the Web for a fair amount of time, from forums to collaborative efforts like wikipedia. Even DMOZ, with its myriad of editors from all walks of life could be said to be a social site. Korea’s Naver is a hybrid of search engine and social site, with users providing information on topics that isn’t easily found on Korean sites. The Web is evolving, and social sites will definitely play a role in that evolution.

  33. Bill,

    I would say that the sites with the most trust are the ones that would make a good seed site. I’m thinking Yahoo, .edu, .gov, etc.

  34. Hi Lori,

    Trust is definitely one aspect of selecting a good seed site. The patent filing also emphasizes attempting to be more efficient in crawling the Web and finding new URLs, as well as making sure that they cover as wide a range of different kinds of sites as possible, so while trust is important, those other issues may have them looking at more than just the Yahoo directory or .edu and .gov sites.

    One potential problem with using .gov sites is that they may not be good choices as seed sites for non-governmental topics. But, a site like usa.gov might be a great choice as a seed site for a search engine trying to find new US government URLs.

  35. Hi Lori,

    And that’s why selecting seed sites to begin crawls with is a challenge – balancing trustworthiness with the ability to yield new URLs on topics that might otherwise be missed. For instance, the best site to start web crawls about technology might be a blog that links to new technology sites on a regular basis, rather than the Yahoo Directory or DMOZ or wikipedia, or a .gov or .edu site.

  36. The best starting point would be Digg-a-like sites. Social, updated frequently, but not polluted as Facebook and Twitter (at least homepage or category homepages).

  37. Hi RNasty,

    It’s possible that sites like a Digg or a Hacker News might make reasonable seed sites, and could possibly be good starting points for web crawls at least involving some topics. I suspect that they may not be the best starting points for finding URLs about other types of topics, such as history or literature or commercial activities on the Web.

  38. Hi Bill,

    Nice post. Well when we talk about seed sites we are either talking about seed sites for trust purposes, i.e. the closer you are in the link graph to the top seed sites the more trust your own domain will accrue, or we are talking about seed sites for “discovery” purposes.

    I think that the social web is an obvious possibility and I see no reason why they would not be using this. Also the Google toolbar, Gmail and other “google properties” will be used.

    DMOZ? I don’t personally think so. Google likes speed and freshness and for Google to rely on crawling of new sites added to DMOZ goes against everything that Google is trying to achieve.

    They could be using regional or country specific seed sites and this would be logical and also niche related seed sites, tech, financial, news etc.

    Just my 2 cents!

  39. Hi Rob,

    Thank you.

    Under this patent, the use of the term seed site is in reference to the discovery of new URLs to index rather than some accrual of domain trust. There’s no mention of Yahoo’s trustrank (or even Dual Trustrank), and the mentions of authority and trustworthiness are used to describe potential seed sites, without any mention of a transference of authority based upon a distance from one of these seed sites.

    It’s likely that Google uses a similar process, and may use its own services such as a toolbar to identify URLs to crawl.

    One thing that DMOZ does have going for it is a decent taxonomy of topics and regions to cover, but I think you’re right about its possible lack of freshness. Using sites that focus upon geography and niche topics may yield a greater return for their effort in discovering new URLs to crawl.

  40. I think that people should still make an effort for dmoz listing though it would be a poor start for a crawl, that said I am sure that it is featured early since it generates trust for Google despite being so stale.

    for me its about new information, sites that refresh and ping are where its at.

    .edu and .gov will only supply stale data but again G factors trust so they are already high on the list, on my other blog (shaunparker) i get a mainstream listing in under 30 secs from the time the post is published, my new site (this one) is still at 3 hours but nevertheless I have a good landing point in serps, Google already has its crawl order which is a combination of newness and trust.

  41. Hi Shaun,

    I don’t think there’s any harm in trying to be listed in DMOZ.

    There are .edu and .gov sites that feature new data, but often they may be best used as seed sites for other .gov and .edu pages.

    Good point about pinging the search engines when pages are updated. I think there’s more and more of a movement towards letting a search engine know explicitly that there’s something new on a site.

  42. IMHO, a good seed site would be a well known social media site like Twitter. it has everything that would make a good seed site. its very current. it is large enough to provide a wide mix of links from a number of websites. being a very popular website, Twitter is also the gateway to a large number of websites.

  43. Hi Ron,

    Thanks. Twitter may be an ideal place for a search engine to find links to new URLs, so it definitely has that going for it as a potential seed site.

    The patent filing describes how a search engine might choose seed sites to start web crawls based on a number of factors, including some amount of authority or trustworthiness. While twitter might be a site that would provide an entry way to a large number of new pages, there’s not a lot of control over what kind of pages people might post in their tweets. Regardless of that, the work of sorting through those new URLs and determining which ones the search engine might index, and which ones it might decide not to might be worth the effort since, as you note, it is a “gateway to a large number of websites.”

  44. Hi all,

    I would not be at all surprised if search engine began using social media sites such as Facebook as ‘seed’ sites however I totally agree with Bill that there would have to be some level of trust factoring involved in the process. As Bill said, with the likes of Twitter there is very little quality control over the links which people share. That said, blogs provide a huge link seeding resource for search engine spiders and there is even less control over what people publish on their blogs!

    Search engine spiders are certainly beginning to realise the power behind social media; as I’m sure you are all aware Google have recently began factoring real-time search into their ranking algorithm, therefore using these sites as a seed may well be a possibility in providing the most up to date search results. However surely it is far more difficult to apply this to social media sites such as Facebook due to the access restrictions – the data is not publicly available like it is with Twitter?

    Chris

  45. Thank you, Chris.

    All very good points.

    It does look like Facebook is presenting more and more information outside the confines of Facebook, making content from the site more accessible to people not logged into the site.

  46. Hi Lee,

    Thanks very much for sharing your experience.

    We can’t tell for certain if Google is using sites like stumbleupon or digg as seed sites with any certainty, but it does appear that they revisit those sites on a very frequent basis and follow links from them to discover new pages. Regardless of whether or not those are sites where search engines start their web crawls from, they are places that the search engines visit frequently enough that you may find your pages indexed quickly if they appear upon them.

  47. Found this today and thought I would add my two penneth..

    Last week I was working on a site that sells art. The guys had a lovely looking cool site built but teh designer had not even added meta tags for description, and title tag was generic across the whole site. Google had indexed it, but only a couple of pages because of generic title tag and pages being mainly image based (artwork).. In effect it was a new site to the SE’s pretty much.

    I re tagged the whole site, added a few image alt tags and title tags on links.. and wacked the URL into stumble upon and digg.. within 3 hours Google had re spidered the top pages of the site, and it also jumped onto page 4 for its main term. Which has half decent competition on it.

    From that experience I would say sharing / bookmark sites (just the top 3 or 4) are good places to start seeding a site to be indexed.

  48. I agree with Marks comment regaridng Yahoo and Dmoz, lets face it Dmoz is hardly an ideal place to look for new content given the fact it often takes them two years to approve a listing! Additionally Yahoo’s directory is paid inclusion. The last site I launched myself was indexed within a few hours of going live with just one inbound link from techcrunch. Personally I think that any well established and trusted site that is regularly updated makes an ideal candidate for a search engine seed site. You must consider that websites are topic related, and that google spiders will visit a regularly updated site more frequently than a site that is rarely modified, Therefore I would speculate that a domains history, theme, trust, and popularity, are all considered during an evaluation on whether to use it as a seed site.

  49. Hi Nick,

    I agree with you. The patent filing pinpoints those items as well as how likely it might be that a search engine will find new URLs on a site that determines whether or not a site might be a good seed site to start a crawl with.

  50. I would disagree with the use of Digg as a seed site, in my opinion, it would be a poor choice, certainly with regards to the trust part of it.

    I personally like Digg, but it’s full of what would be considered by most logical people, spam links and although these links never make it anywhere so it’s likely that search engines ignores them, this raises another issue.

    If you look at what goes popular on Digg (the most likely choice to use as seeds) you’ll notice that the same names appear over and over again. What goes popular on Digg is controlled by a very small population of the Digg community. Therefore, if search engines used this as a seed, it would be using the opinion of a very small number of people and if you look at what goes popular anyway, I would hardly consider it good authoritatative web content…Unless the search engine’s really want to be filled with videos of dogs farting LOL!!!!11111 and panda’s sneezing *sigh* Same with things like Twitter and Facebook, they’re great meme trackers as people share links, but hardly what I would consider a good source of quality content

    I know other people here have dismissed dmoz, but I think it’s perfectly likely it’s still used as a potential seed. OK, it’s not perfect, it’s outdated, but there is at least some sort of editorial control which means for a site to be in there, someone must have considered it worthy comparing it to a set of guidelines (something not present on Digg) and lets be honest, search engine spiders are sophisticated enough now to be able to dismiss the outdated stuff. And remember, these are seeds, it’s not that the spiders will follow the links from dmoz, then stop.

    Other likely contenders for seed sites in my opinion would be respected media outlets such as wired, the BBC, NBC, CNET etc, universities, maybe even one or two very large company sites

    Remember, all they’re looking for is a quality source of links to start the crawl, not an exhaustive list of all the links on the web

  51. Hi Jim,

    Thanks for sharing your thoughts on this topic.

    The patent filing does provide a kind of cost/benefit analysis to decide which sites might be good ones to use as starting points for Web crawls. I think many of the limitations that you point out about Digg can limit its usefulness as a seed site – except perhaps as you note, the search engines really perceive a need to index those dog and panda videos. :)

    There are definitely some benefits to using DMOZ, but the concern there is how likely will it be that pages linked to there will help a search engine find fresh URLs to crawl?

    Media outlets may be a good choice, depending upon their linking policies. If a news site shows a tendency not to link out to sources of stories, it may not be a very good choice either.

    One of the good things about the approach behind the patent is that a search engine could test a number of potential seed sites and compare them against each other, to see how helpful they might be in uncovering information about different topics and markets. The usefulness of selected seed sites could be tracked over time, and choices could be updated every so often as well.

    You may a good point about these seed sites being a quality starting point. As the inventors of the patent noted, they were seeing that the higher quality a seed site was, the higher quality pages they were uncovering in their crawls.

  52. I don’t know.
    Why do search engines trust on DMOZ directory?
    Some DMOZ directory editors publish their sites only to get traffic and reject competitor ones.
    I do not trust on them.

  53. Hi SEO Dart,

    I think one of the attractive aspects of DMOZ isn’t so much the sites listed as it is the organization of the directory into topical and regional categories.

    As you note, it’s possible that some of the sites listed in the directory may be influenced by editor bias or even intentional manipulation. My concern about using DMOZ as a seed site goes more to the likelihood that a search engine might not discovery many new URLs if it starts crawls at DMOZ or a regular basis.

    I’d love to see some kind of data visualization involving how many new sites are added to the directory, how many are removed, and how many are edited to point to new domains or URLs.

  54. Interesting post!

    DMOZ is not great as it takes too long to get in to, but Twitter and Facebook change too frequently and don’t capture age…

  55. Hi Jonathan,

    Thanks. DMOZ isn’t scaling and growing at a rate comparable with the rest of the Web.

    Twitter and Facebook do change very frequently, but it’s possible that they can be used to capture age. One of the things attached to status updates and tweets are post dates, which indicate when something was published. That’s something that a search engine crawling program might find pretty useful. The same is true with updates on Wikipedia, where you can discover when something was added to a wikipedia article.

  56. Hi Bill,

    I agree, twitter and Facebook definitely have a part to play! The web moves quickly and many searches that Google receives are for topical things, things of the moment. Therefore the social media giants have a large part to play in sifting through what’s topical and hot.

  57. Hi James,

    News is circulating of Yahoo closing down the social bookmarking site Delicious. One of the reasons I find that sad is because there’s a good chance that it’s used as a seed site for some searches as well.

  58. Yahoo is planning to close Delicious? That’s one of the few companies/services owned by Yahoo that is actually popular. That would be a real shame as I agree, I’m sure Delicious features heavily in Google’s seed database.

  59. Hi Martin,

    Yahoo reportely laid off about 600 employees before the holidays, and there were statements that they would be consolidating some services they had that duplicated other services, as well as closing some services that were part of their “core” offerings. Delicious was speculated to be amongst those services, though it’s hard to tell for certain if it is something that they will close at this point.

    I’d hate to see it go as well, and I think it is probably useful as a seed site for any of the major search engines.

  60. @Bill I think yahoo has enough power with their services e.g. Yahoo Boss sitesearch and Yahoo Developer Center. They just kickin off unusefull projects, they earn no money from. And this is the right way :-)

  61. Hi Sascha,

    I was happy to see that Yahoo decided to sell Delicious to the YouTube founders recently, because that means that it will be more likely to stick around and to be built upon and improved in positive ways. Yahoo seems lost, and I hope they do find their way.

  62. Hi Bill,

    Yes Yahoo does seem lost. I think they are trimming some of the fat though and putting their resources into their key profit centers. Personally I think that Yahoo needs to streamline their “look” like Google. Google has a very bare-bones look to it. A web 2.0 look if you will. Yahoo’s interface on the other-hand is very busy and discombobulated. I think this needs to be remedied along with their search algorithms to compete with big G.

  63. Hi Dennis,

    I think the differences between Google’s look and feel, and Yahoo’s stems all the way back to their roots. Google as a search engine, and Yahoo as a directory. Google’s long held that spare look, and focused upon being a resouce that people would leave quickly, but come back to over and over again. Yahoo, on the other hand as a directory and portal has always attempted to attract people to their pages and keep them there.

    Yahoo likely won’t be modifying too much of their search algorithms in the near future with Microsoft now running their search database.

Comments are closed.