What concepts does your website cover?
A search engine might look at phrases that you use on your pages to get an idea of the concepts covered by your site.
The search engine might try to decide that certain phrases you use are the “top phrases” that describe topics or concepts about your site.
But what if the search engine is wrong?
What if those top phrases don’t reflect the content of your site accurately? What if some other phrases more meaningfully indicate what your site is about?
If a search engine assigned phrases to your site which might affect the way that your pages are being presented to searchers in responses to queries at the search engine, would you want the search engine to give you the chance to make changes to those phrases that they think your site is about?
New Google Phrase-Based Indexing Patent Filing
A new patent filing from Google describes a way for web site owners and site administrators to view the top phrases assigned to their sites by a phrase-based indexing system developed by Google, and allow those site owners and administrators to add additional related phrases to help in the indexing of their pages. It’s one of a number of Google patent filings involving phrase-based indexing.
Most search engines tend to index web pages based upon individual terms found within those pages rather than upon the concepts contained in them. Concepts are often expressed in phrases, and when certain phrases appear together on the same page, that may be able to tell us a fair amount about the topic of that page.
A few years ago, Google published a series of patent filings that explored a phrase based indexing system looking at how related phrases are used in the content of pages, to index those pages, to understand the topics of pages, to provide personalized search to searchers, to locate duplicate content, and identify web spam.
I’ve written a few earlier posts about this phrase-based indexing:
- Google Phrase Based Indexing Patent Granted
- Phrase Based Information Retrieval and Spam Detection
- Google Aiming at 100 Billion Pages?
- Move over pagerank: Google’s looking at phrases?
It is possible that Google is using this phrased-based indexing system or something very similar to it, but we don’t know that for certain. If they are, it’s possible that at some point in time, they might tell us what they believe are the top phrases for our web sites, and let us suggest changes to those phrases.
The Google patent application is:
Integrating External Related Phrase Information into a Phrase-based Indexing Information Retrieval System
Invented by Anna L. Patterson
Assigned to Google
US Patent Application 20090070312
Published March 12, 2009
Filed September 7, 2007
An information retrieval system uses phrases to index, retrieve, organize and describe documents, analyzing documents and storing the results of the analysis as phrase data.
Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified.
Changes to existing phrase data about a document collection submitted by a user is captured and analyzed, and the existing phrase data is updated to reflect the additional knowledge gained through the analysis.
Top Phrases for a Site
When the phrase-based indexing system explores phrases that show up on your pages, it will also look for related phrases that show up on other pages and other sites that use your phrases as well. The appearance of phrases and related phrases on your site and other sites is an important part of how this system indexes pages.
In addition to seeing which pages where particular phrases and related phrases occur, the indexing system can also determine a “set of representative or significant phrases for a particular web site.”
The “top phrases” for a web site might be considered indications “of the queries for which the website is likely to be relevant.”
On a web site, the phrase-based indexing system might look at each page on the site to decide upon the top phrases for each page based upon methods involving phrase-based indexing, and then aggregate those top page phrases to determine the top phrases for the site as a whole.
Phrases on pages that are closer to the root directory of the site (the directory where the home page exists) might be given more weight than pages that are deeper in the page hierarchy of the site.
So, if the phrase “baseball stadiums” is found at a page with the URL “http://www.example.com/baseball-stadiums.htm,” and the phrase “football stadiums” is found on a page with the URL “http://www.example.com/football/football-stadiums.htm,” the phrase “baseball stadiums” might be given a higher score than “football stadiums.”
The patent application covers how a site owner might change the “top phrases” for a site, and how Google might look at those suggestions for change, and what impact the changes might have on the phrase based indexing system.
If Google starts letting us look at the “top phrases” for our pages, and a chance to change those phrases, it might be worth looking more deeply at the patent filing.
What might be helpful for site owners now is to look at the pages of your site, and see how well the topics and concepts that you want your pages to express are understood by readers, and also might be indexed by search engines.
For example, if you do have a page that is intended to describe “baseball stadiums,” do other “related” phrases appear your page that people might expect to see on a page about those stadiums, such as “ball park,” “bleachers,” “scoreboard,” “big leagues,” “playing field,” “pitcher’s mound,” “home plate,” “infield,” “outfield,” “first base,” “center field,” “dugouts,” and others. What “related” phrases appear on other pages about baseball stadiums?
In a phrase-based indexing system, the use of phrases and “related” phrases on your pages, and on other sites might determine how well your pages get indexed in the search engine, and how well your pages use phrases and related phrases. The “top phrases” for your pages may be the ones that a search engine decides will be most relevant for queries people use to find your pages.
If Google isn’t using a phrase based indexing system, the use of related phrases might work to expand the amount of queries that your pages show up for in search results anyway.