SEO is Undead Again (Profiles, Phrases, Entities, and Language Models)
SEO and Keyword Matching
I don’t recall clearly when I first started calling what I do SEO, and I really didn’t have an official title at my first inhouse SEO position back in 1996. I thought of that role as a webmaster, marketing manager, IT department, technical consultant, and did whatever else needed to be done. A friend’s sister worked at Digital Corp, and she sent us an email about a new service they had started called Alta Vista one day.
That’s probably when we first started thinking seriously about search engines, and their potential to help or to harm businesses. When Google came along, we became a lot more serious about search.
Back in the days just before Google started gaining any popularity, when the leading search engines counted amongst their ranks Alta Vista, Excite, Infoseek and Lycos, a paper titled What is a tall poppy among web pages? by Glen Pringle, Lloyd Allison and David L. Dowe explored the possible decision trees that those search engines used to try to decide how pages might be ranked by search engines.
They gave us the following list of possible ranking signals:
- Number of times the keyword occurs in the URL.
- Number of times the keyword occurs in the document title.
- Number of words in the document title.
- Number of times the keyword occurs in meta fields – typically keyword list and description.
- Number of times keyword occurs in the first heading tag <H?>.
- Number of words in the first heading tag.
- Total number of times the keyword occurs in the document including title, meta, etc.
- Length of the document.
That list doesn’t vary too much from the kind of SEO analysis that many these days refer to as onpage SEO. But if you look carefully at the list, its focus is upon matching keywords in a document to the keywords used in a query.
When Google brought PageRank to search, we started thinking more about the importance of links pointing to sites, and to the anchor text in those links. But SEO’s focus still seemed to be upon whether or not the keywords used in queries were also used upon the pages of a document, or the links pointing to a document.
Reranking Search Results
In November of 2003, something at Google changed, and the comfortable rankings that many websites had attained suddenly shifted. The change was referred to as the Florida update, following a practice developed at the Webmaster World forum of naming different Google updates, which seemed to happen every 4-5 weeks, like you would a hurricane. There have been a lot of theories about what the change might have been that caused that change in rankings.
One that I like, but won’t insist was the cause for the massive upheaval in rankings is explored in a Google patent entitled Ranking search results by reranking the results based on local inter-connectivity.
What’s interesting about the patent is that it describes a way to take a certain number of the top search results from Google for a specific query, and rerank them based upon how they link amongst each other. For example, if you look at the top 100 pages that show up in the search results, and see which pages link to other pages in those results, the pages that are most linked to may be boosted in the rankings of the pages showing up for those queries.
The Local Interconnectivity approach may or may not have been the cause of the changes from the Florida update, but the kind of reranking that it describes could cause the kinds of changes seen.
I’ve written a number of posts about other possible reranking results that the search engines may use to rerank and filter results, and many of those filters are part of the evolution of search as well. A few of those posts compiled many of the reranking approaches, and show some of the many ways that we’ve moved on since the early days of keyword matching:
- 20 Ways Search Engines May Rerank Search Results
- 20 More Ways that Search Engines May Rerank Search Results
- Another 10 Ways Search Engines May Rerank Search Results
Looking Beyond Web Publishers and Keyword Matching
Many of the methods described in those posts still rely upon a certain ranking approach that looks at the kind of onpage factors that I listed above combined with information about links from web publishers. But they mostly ignore one of the more interesting kinds of data that the search engines have all been collecting for years – how searchers actually use the web, how they:
- Perform searches,
- Browse the Web
- Refine queries during search sessions
- Click upon certain results
- Pass over other search results
- Spend more or less time upon pages
- Bookmark, save, or print other pages
- Interact in other ways with pages browsed and search results seen.
I wrote a post earlier this year, Improved Web Page Classification from Google for Rankings and Personalized Search, about a patent that describes how Google might classify pages based upon profiles that they create for websites, queries, and users. The basic ideas behind it aren’t so different from a post I wrote back in 2007 about a Microsoft patent that describes a similar approach – Personalization Through Tracking Triplets of Users, Queries, and Web Pages.
In a way, both describe how a search engine might transform from displaying pages based upon keyword matching to one that recommends pages based upon actual user behavior and the possible intent behind searches.
A snippet from that post:
Imagine a search engine keeping track of each user (u), as they perform a query (q) on the search engine, and seeing which pages (p) they click upon, and collecting those selections in what they call “triplets” of data, represented like this – (u,q,p).
Then consider that the search engine might map and compare those triplets of information against each other to see what kinds of relationships and associations exist between people making the same searches, faced with similar results. That information could then be used to personalize results shown to individuals.
Imagine that instead of just showing pages in response to that type of user data, the search engines also started spending more time presenting possible query refinements to get at the heart of the intent behind a search – whether there was a certain informational need or situational task that a searcher might be trying to address.
When people talking about search engines throw the word “semantics” into a description of search, there’s often a discussion on mathematical models that might be used to try to understand the meanings behind words. There’s sometimes a reference to the company Applied Semantics, which merged with Google back in 2003, but it seemed on the surface that the methods that Applied Semantics brought to Google were being used with Google’s advertising.
A Google patent published earlier this year, which I wrote about in Search Based upon Concepts: Applied Semantics and Google describes how the Applied Semantics method could be used in Web search, providing a more interactive approach to search that involves showing suggested query refinements that may expand original queries to go beyond keyword matching to get at the meaning behind a query.
The globalization of search also means that search engines need to understand multiple languages, and return results to searchers around the globe. Google has hinted at becoming a multilingual search engines with the development of language models. These language models are also a key to looking beyond keywords found upon a page.
For example, if you were to take a phrase like “auto mechanic,” and translate it into French, and then translate it back into English, there might be a few reasonable results found in that translation back, such as “car mechanic,” and “automobile mechanic,” and “auto mechanic.” In my post Google Synonyms Update, I described some of the approaches to synonyms that Google might be using to expand queries for searchers that broaden search results in reasonable ways.
Phrases and Named Entities
A number of patent filings from Google look beyond matching individual keywords in document to identifying and discovering phrases that have unique meanings.
Search engines sometimes took a shortcut in indexing some words, by not including some very frequently occurring words in searches. Terms like “a,” or “the,” or “on,” or “of,” would be passed over. But sometimes those frequently occuring words added meaning in the right context.
Google published a patent about meaningful stopwords and stop-phrases a couple of years ago. If you search for the word “matrix” you tend to see results about mathematics. If you search for “the matrix,” you tend to see results about a movie of the same name.
Another set of patent filings from Google described how the search engine might distinguish between “good” phrases and “bad” phrases, and rerank search results based upon whether or not a certain number of good phrases appeared in a top number of search results. For example, if you searched for “baseball stadium,” the search engine might look at the top 100 results, and calculate which other “good” phrases appeared in those results. Pages with more of those co-occurring phrases (up to a certain point) might be boosted in search results and rank higher.
I wrote a post this year, Phrasification and Phrase Posting Lists, about a second generation of patents from Google that explored how the search engine might implement a phrase based indexing system to rerank search results by creating post lists that indicate which good phrases pages within its index might contain. Again, the idea is that if a page tends to use many of the same phrases that other pages on the same subject uses, it’s more likely to be about that topic.
The phrase-based indexing system also looks at anchor text pointing to pages, and may give more weight to links that use words and phrases that co-occur, or are related phrases.
Taking the idea of looking at phrases one step further, some words and phrases can be said to be “entities,” in that they refer to specific people, places, and things. John Wayne is an entity, as is the Empire State building. A brand is an entity, and an idea is an entity.
A search engine might look at pages on the web, and extract information about those specific “named entities,” and collect that information to provide answers to questions, such as “when was John Wayne born.”
It might also attempt to identify entities in queries, and associate those entities with specific web sites when there seems to be a certain level of confidence about the relationship between a site and an entity. The following posts provide some examples of how named entities might impact search:
- Not Brands but Entities: The Influence of Named Entities on Google and Yahoo Search Results
- How a Search Engine Might Interpret Ambiguous Queries through Entity Tags
- Google and Metaweb: Named Entities and Mashup Search Results?
Search engines are evolving along a number of paths from their early days of keyword matching, which include approaches such as incorporating user-behavior data into ranking pages, creating statistical language models, using semantic ontologies like that from Applied Semantics to become more interactive, understanding phrases better, understanding when phrases may refer to a specific person, place or thing, and more. I’m really just brushing the surface with this post on the many directions that they are taking.
SEO is becoming more complex, but the ultimate goal is still to try to find useful and meaningful results for people trying to fulfill informational and situational needs. Search is changing, and the way that people search is changing as well, whether they try to use a conventional search engine, or even attempt to have a network of friends and associates provide answers on social sites.
If your idea of basic SEO echos the list above from the 1998 Tall Poppies paper, with back links and PageRank thrown in for good measure, you’re going to find yourself mystified when things like the Google MayDay rankings changes happened earlier this year, reducing visits based upon long tail queries for many sites, while increasing them for others.
This is the last of the “SEO is Undead” series.
The first post in the series, SEO is Undead 1 (Links and Keyword Proximity) explored how ideas about linking and how close a keyword might be to other keywords on a page might have transformed since the early days of SEO.
The second post in the series, Son of SEO is Undead (Google Caffeine and New Product Refinements), explored a little more deeply how changes sometimes take place in search and SEO.
What changes have really captured your attention over the years?