Some patents from the search engines provide detailed looks at how those search engines might perform some of the core functions behind how they work. By “core functions,” I mean some of the basics such as crawling pages, indexing those pages, and displaying the results to searchers.
For example, last December I wrote a post titled Google Patent on Anchor Text and Different Crawling Rates, about a Google patent filed in 2003 which gave us a look at how the search engine crawled web pages, and collected the web addresses, or URLs, of pages that it came across.
The patent the post covered was Anchor tag indexing in a web crawler system, and it revealed how Google may determine how frequently it might visit or revisit certain pages, including crawling some pages daily, and others even on a real-time or near real-time basis – every few minutes in some cases. While there’s been a lot of discussion in the past few months online about real-time indexing of web pages, it’s interesting to note that the patent was orginally filed in 2003.
That older patent also covered topics such as how a search engine crawler might handle temporary (302) redirects differently than permanent (301) redirects, by noting and sometimes following the temporary redirects immediately (to make a decision as to what page to show in search results), and collecting the URLs associated with permanent redirects and putting them into a queue where they might be addressed later – up to a week or more later.