Term Frequency and Inverse Document Frequency at Google

Sharing is caring!

Learning About SEO and How It Works With Language on Pages

A couple of the concepts that you learn when learning SEO besides an inverted index at Google is how often words appear on pages and in Google’s index of the Web.

Term Frequency

Term Frequency is a measure of how often a term may appear on a page. Some terms are common on most pages. For instance, articles like “the,” which might be the most common word in the English Language. Less common words can also appear frequently, especially if they are the page’s main topic.

“The” is often one of a group of words that is a stop word because they are so common and don’t tell you very much about the page they appear on. I wrote about stop words in Google Stopwords and Stop-Phrases.

It’s not unusual for a search engine to know the frequency of words on a page. The idea of looking for term frequency on pages was something that is from the 1950s.

Inverse Document Frequency

Almost 20 years later, in the 1970s, a related concept started to appear. This concept is Inverse Document Frequency.

It can tell you whether a term is common or rare in a corpus of documents.

You can get it by dividing the total number of documents in the corpus by the number of documents containing the term in the corpus.

Term Frequency and Inverse Document Frequency

You can look at Term Frequency joined together with Inverse Document Frequency. That means that you can tell whether a page is likely about a certain term. It would be one that shows up a lot on that page. That term could be a common one or a rare one on the index of the Web.

This approach to term frequency fits in well with understanding where all the words are on the web in an inverted index. Both are very important to search engines and to SEO.

Some pages are about a specific term because that term appears on that page frequently. That page may be more common or rarer in the Web corpus. That could depend on how many documents the term appears on in pages of the web. So a term such as “indeterminacy” is one with a specific meaning, and it appears fewer times on Google’s index of the Web. It is a rare word.

As an SEO, you can perform keyword research and create text for a page. You can decide what a page may be about. You are placing that page in the web corpus, and it becomes a document that contains that word. A term that is on a more rare page may have less competition from that corpus. But it also may be less searched for by someone who might become a customer of the site it is placed on.

Abbreviating Term Frequency-Inverse Document Frequency

Term Frequency – Inverse Document Frequency is often presented as TF-IDF to shorten the name. Those are concepts search engines know about and they often appear together since they are as related as they are. When I search the USPTO.gov site for patents for either concept assigned to Google, I get a little over 350 for each of them. often the same patent mentions both concepts.

TF-IDF has been part of many Algorithms used at Google for a wide range of purposes. Consider that words are a large part of the Web index. They are also an important part of it. I remember Term Frequency and Inverse Document Frequency during the creation of query refinements that appear at the bottoms of pages of search results at Google. It’s worth seeing in what else they appear.

TF-IDF at the USPTO Last Week

Sometimes you will see statements about Term Frequency and Inverse Document Frequency appear on patents in passages such as this one:

In some implementations, the statistical metric may represent an information content of the matching semantic criteria (e.g., based on a term frequency-inverse document frequency (“tf-idf”) where documents correspond to queries). In an illustrative implementation, if a new piece of information is true for 90% of queries, then the new piece of information may not be useful. The tf-idf may include a numerical statistic reflecting how important a word is to a query in a collection or corpus of queries. The tf-idf value may increase (e.g., proportionally) to the number of times a word appears in the corpus of queries but may be offset by the frequency of the word in the corpus.

Term Frequency and Inverse Document Frequency is Appearing in Patents About Entity Properties on the Web

That quote is from the following patent, granted July 6, 2021.

Selecting content using entity properties
Inventors: Henrik Jacobsson
Assignee: Google LLC
US Patent: 11,055,312
Granted: July 6, 2021
Filed: October 19, 2016

Abstract

Systems and methods of the disclosure relate to selecting content via a computer network. The system can receive a query to generate content selection criteria. The system can identify an entity of the query and a query graph based on the entity. The system can access a database to identify a template corresponding to the query graph. The template can include a topology and a named variable. The system can determine multiple semantic criteria corresponding to the named variable that matches the query graph. The system can use a statistical metric of each of the matching semantic criteria to select candidate content selection criteria.

Both information retrieval concepts are still in use today, even though SEO is changing to be more about entities than it was before. This patent focuses on finding the properties of entities.

So Term Frequency and Inverse Document Frequency have both been around for more than 50 years as part of information retrieval. Both are still part of modern algorithms as long ago as last week at Google. In the Wikipedia page on TF-IDF, they tell us that “Term Frequency and Inverse Document Frequency is one of the most popular term-weighting schemes today.”

Term Frequency and Inverse Document Frequency Conclusion

The ability to use TF-IDF for many algorithms about the words in an index makes it important as a tool to understand when it comes to search. When you search an inverted index for specific words, some will be more common and some will be rarer. This isn’t keyword density. It does not calculate the frequency of a word compared to all the words in a document. If you understand what term frequency and inverse document frequency both are, and how they could work together on an inverted index, You have an idea of how search and how SEO both work.

Sharing is caring!

13 thoughts on “Term Frequency and Inverse Document Frequency at Google”

  1. Good stuff.
    In my ‘Humantics’ presentation I discussed ‘entity attributes’ additional elements of an entity to differentiate it, in the ultimate effort of disambiguation for both users and (dumber and smarter than human) search engines.
    Easy to see where TF-IDF can be and is used to help SEarch engines to identify and disambiguate the entities it thinks it finds based on the massive corpus it has to mine.
    Thanks for finding and surfacing this example Bill. Cheers

  2. Hi Grant,

    I was happy to see that patent show up on a search for Term Frequency and then for Inverse Document Frequency, and was also excited by how recent it was and that if focused on finding out more about entities. We’ve been doing so much with keywords in the past with SEO that it was good to see Google using it in a patent on entity properties.

    There are some information retrieval concepts that haven’t been looked at enough and are being turned into SEO Tools. But that isn’t the only reason to understand concepts like these again. 🙂

    Bill

  3. Hey Grant,

    Top notch info, always keep an eye for another post. Not going to lie, took me a couple-rereads (well, more than a couple) to actually grasp the idea. It obviously makes sense, but I can’t help but think that now I’ve got to change my writing strategy to use more topic-specific terminology as opposed to making it an easy read for anyone visiting to understand. Do you think this would be counter-productive?

    Greg

  4. So in your experienced opinion what is the red-meat of a deeper understanding of TF-ID as used in these patents? What actionable guidance can be issued once informed by this?

  5. Hi Jeremy,

    Thanks for asking. Web pages mostly contain a lot of words, which search engines index. The choices of words that a site owner choses, or a search engine might expect to see on a site tell us a lot about that site. Understanding the frequency of words, and how common or rare they may be is something we should keep in mind when we add words to a page, or choose the keywords that a page might be found for.

    When we place pages within hierarchies or ontologies that tell us about the classification of words we have chosen, and how they are related to each other, we are getting a glimpse at the building blocks of the ideas behind a page. How well do the words on those pages connect together? How original is the content on a page or a site? Are there frequently co-occurring complete and meaningful phrases that are on other pages that rank highly for a term that we want to try to optimize one of our pages for? Will adding those phrases potentially help our pages be found by more people? Understanding term frequency of different words can give us more literacy to make good decisions.

  6. Hi Bill. Thank you for the post. I read it couple of times and I started wondering if implementing TF-IDF would actually help Honesty, I was watching to one of the podcasts that you and Jason Barnard were talking about ranking signals covering , when you told that you’re simply focusing on topics, since TF-IDF were the things of the past. From then I was not using TF-IDF but, now I’m gonna think about it again 🙂

  7. I am old enough to be in the field but I have noted every point you mentioned in the write-up. Is there any way to subscribe to your mailing list?

  8. i found your blog really helpful. I am always in need of such blogs. Good work from the writer. Keep cherishing your writing. cheers

  9. This was an excellent read! I have recently launched a masonry business and this has made some things clear for me thank thank you for the information.

Comments are closed.