As a webmaster, when you put a page up on the web, there may be parts of that page that you may not want to have indexed by a search engine.
Many web pages contain information that isn’t unique to each page, such as the navigation for a site, copyright notices, advertising, links to other sites such as blog rolls, and other sections that may not contain information about the main topic of the page itself.
Yahoo’s Robots-Noindex Classes
In May of 2007, Yahoo made a post on the Yahoo Search Blog about how webmasters could let the search engine know that content in certain sections of pages shouldn’t be returned in search results to searchers, titled Introducing Robots-Nocontent for Page Sections.
Continue reading Which Sections of Your Web Pages Might Search Engines Ignore?
Yahoo was granted a patent this week which describes how anchor text in links may be used to increase the relevancy ranking of a page pointed to by that anchor text. The patent was originally filed in 2002, and it discusses how anchor text might work while naming the Altavista search engine as a possible place where the methods it describes might be implemented. Yahoo acquired the company that owned Altavista, and the technology is theirs.
While the patent is fairly old, it provides some details about how anchor text might be used by a search engine in a search index that may not be widely known.
It’s fairly common knowledge that the major commercial search engines pay attention to the anchor text in links pointing to pages, and may consider a page to be even more relevant for a query term if the term not only appears on a page, but also appears in the linked anchor text pointing to a page. Some pages may even be determined to be relevant for words that they don’t contain, but which show up in links to those pages.
Continue reading Yahoo Patents Anchor Text Relevance in Search Indexing
Search engine optimization is an ever growing and ever changing field, and as search engines and the Web change, so does SEO.
There are no classrooms, nor college courses, no single one site or conference series or book that can help you keep up with those changes.
Paying attention to a lot of blogs, news reports, press releases, and other sources of information can help provide some insights about changes in SEO, and discussions at forums and conferences and social sites can present a lot of signals and noise about what might be new in search. It’s not always easy, and not always even possible to distinquish between the signals and the noise sometimes.
I look at a lot of patent filings and papers from the search engines here because they can provide views of how search engines may work from the perspective of the search engines. I consider them primary sources because they come directly from the search engines, but even those sources often only provide glimpses of possibilities rather than actual insights into how search engines function.
Perhaps the best value that may be taken from search engine patent filings isn’t so much the processes that they describe, but rather the hints of assumptions behind some of the methods and systems that they present.
Continue reading How a Search Engine Might Rank Bookmark Sets, Playlists, Directory Pages, and other Collection Items
When you do a search for some terms over at Google, you might get a mix of results from different types of searches, including Web pages, news stories, images, videos, book listings, and others.
While we’ve been seeing results like this for over a year, we really haven’t heard much from Google on how they go about deciding what to show us where within search results.
We now have some ideas on how those results are blended together, straight from Google, through a patent application published this week at the US Patent and Trademark Office.
David Bailey, one of the inventors listed on the patent, gave us a look Behind the scenes with universal search at the Official Google Blog last year, where he told us of one of the challenges behind Universal Search:
Continue reading How Google Universal Search and Blended Results May Work
Web pages can contain a lot of information about various types of objects such as products, people, papers, organizations, and so on. Information about those objects may be spread out on different pages, at different sites.
For example, a page may host a product review of a particular model of camera, and another page may present an ad offering to sell that model of camera at a certain price.
One page might display a journal article, and another page could be the homepage for the author of that article.
Someone searching for information about the camera, or about the author may need information contained in both pages. They may have to use a search engine to locate multiple pages, to find the information that they need.
If there were a way for a search engine to automatically identify when information on different web pages relates to the same object, that might be helpful to searchers in a number of ways.
Continue reading How Search Engines Can Index Pages in Parts
Many web pages contain more than one topical section, or blocks, which may make it difficult for a search engine to tell what a page is about when it is trying to index that page.
These blocks may include such things as a main content area, navigation bars, headings, footers, advertisments, and other content that may refer to other pages on a site, or on other sites.
The Value of Knowing the Most Important Block
Being able to identify a block within a web page that represents the primary topic of that page may help a search engine decide which words are the most important ones on the page when it tries to associate the page with keywords that someone might search with to find that page.
Identifying that content might also help the search engine decide what topic is most relevant to any ads that they might show on the page if they are an advertising partner with the publisher of the page.
Continue reading Search Engines, Web Page Segmentation, and the Most Important Block
One of the technical issues that can cause problems with a search engine crawling a site to index its pages is when the content of pages on that site appears more than once on the site at different URLs (Unique resource locators, or web page addresses).
Unfortunately, this problem happens more frequently than it should.
A new patent application from Yahoo explores how they might handle dynamic URLs to avoid this problem. What is nice about the patent application is that it identifies a number of the problems that might arise because of duplicate content at different web addresses on the same site, and some approaches that they might use to solve the problem.
While search engines like Yahoo can resolve some of the issues around duplicate content, its often in the best interest of site owners to not rely upon search engines to fix this problem on their own.
Avoiding the Crawling of Duplicate Pages
Continue reading Same-Site Duplicate Pages at Different URLs
The order that pages appear in the results of a search at a search engine may be influenced by the number of pages that link to that page, and by rankings of the pages that link to that page.
When a site is linked to by a popular and trusted domain, that link might provide more value (and a higher ranking) than a link from a site that is less popular and trusted.
Ages of Linking Domains
A new patent application from Microsoft adds another twist, by also ranking domains based upon the ages of domains which link to those domains.
Continue reading Do Domain Ages Affect Search Rankings?