There was a park in the town in Virginia where I used to live. It was a railroad track that had become a walking path. At one place on that path was a historic turntable where cargo trains might get unloaded. It could join later trains or trains headed in the opposite direction. This is a technology that is no longer used. But it is an example of how technology changes and evolves.
Latent Semantic Indexing is Old Technology
Some people claim that Google uses Latent Semantic Indexing. They believe that by saying that, they are saying that Google is using synonyms and semantically related words. They are not correct. LSI is just one type of Language model based on semantics. It even has the word “semantics” in it. But that does not mean that LSI stands for all semantics. LSI is a particular type of semantics that Bell labs patented. The patent follows below.
Google is likely looking for synonyms and semantically related words on pages. That doesn’t mean that using some toolmakers tools that use the initialism LSI in their tools name will help pages rank higher in search results. For example, Latent Semantic Indexing is an old patented technology, but that doesn’t mean that Google is using synonyms and semantically related words the way that LSI does. Google does like synonyms and Semantics, but they don’t call it Latent Semantic Indexing. For an SEO to use those terms can be misleading and confusing to clients who look up Latent Semantic Indexing and see something very different. There is no Wikipedia information on LSI Keywords. There is no information about how LSI Keywords might use LSI. There are no patents that explain how LSI Keywords work because they have never been patented.
I thought it might be helpful to explore Latent Semantic Indexing and its sources in more detail. It is a technology invented before the Web was around. It works to index the contents of document collections that don’t change much. Latent Semantic Indexing (LSI) might be like the railroad turntables that are used on railroad lines.
There Is a Website for LSI Keywords, but No Patent for LSI Keywords
A website offers “LSI keywords” to site owners but doesn’t provide any information about how they generate those keywords or use Latent Semantic Indexing (LSI) technology. It does not tell us how they have been generated or provide any proof that they make a difference in how a search engine such as Google might index content that contains those keywords. I came across a page from the makers of LSI Keywords that sounds more like it uses the technology behind Phrase Based Indexing instead of Latent Semantic Indexing. It links to the Wikipedia article on LSI, but there are no “LSI Keywords” on the Wikipedia page.
Where does Latent Semantic Indexing (LSI) come from?
One of Microsoft’s researchers and search engineers, Susan Dumais was an inventor behind a technology referred to as Latent Semantic Indexing. She worked on developing LSI at Bell Labs. There are links on her home page that provide access to many of the technologies that she worked upon while performing research at Microsoft. Her papers are very informative and provide many insights into how search engines perform different tasks. Spending time with them is highly recommended.
She performed earlier research before joining Microsoft at Bell Labs. This includes writing about Indexing by Latent Semantic Analysis. She was also granted a patent as a co-inventor on the Latent Semantic Indexing process. Note that this patent is from April of 1989 and got published in August of 1992. The World Wide Web didn’t go live until August 1991. The Latent Semantic Indexing (LSI) patent is:
Computer information retrieval using latent semantic structure
Inventors: Scott C. Deerwester, Susan T. Dumais, George W. Furnas, Richard A. Harshman, Thomas K. Landauer, Karen E. Lochbaum, and Lynn A. Streeter
Assigned to: Bell Communications Research, Inc.
US Patent: 4,839,853
Granted: June 13, 1989
Filed: September 15, 1988
A methodology for retrieving textual data objects is disclosed. The information is treated in the statistical domain by presuming an underlying, latent semantic structure in the usage of words in the data objects. Estimates of this latent structure are utilized to represent and retrieve objects. A user query is recouched in the new statistical domain and then processed in the computer system to extract the underlying meaning to respond to the query.
Problems that Latent Semantic Indexing (LSI) was to solve
Because human word use includes extensive synonymy and polysemy, straightforward term-matching schemes have serious shortcomings–relevant material gets missed because different people describe the same topic using different words and, because the same word can have different meanings, the irrelevant material will get retrieved. The basic problem may be stated that people want to access information based on meaning, but the words they select do not adequately express the intended meaning. Previous attempts to improve standard word searching and overcome the diversity in human word usage have involved: restricting the allowable vocabulary and training intermediaries to generate indexing and search keys; hand-crafting thesauri to provide synonyms, or constructing explicit models of the relevant domain knowledge. Not only are these methods expert-labor intensive, but they are often not very successful.
The summary section of the patent tells us that there is a potential solution to this problem. Keep in mind that Latent Semantic Indexing was developed before the world wide web grew to become the huge source of information that it is today:
These shortcomings, as well as other deficiencies and limitations of information retrieval, are obviated, following the present invention, by automatically constructing a semantic space for retrieval. This treats the unreliability of observed word-to-text object association data as a statistical problem. The basic postulate is that there is an underlying latent semantic structure in word usage data that is partially hidden or obscured by the variability of word choice. A statistical approach estimates this latent structure and uncovers the latent meaning. Words, text objects, and, later, user queries extract this underlying meaning, and the new, latent semantic structure domain is then used to represent and retrieve information.
How Latent Semantic Indexing (LSI) Works
To illustrate how Latent Semantic Indexing (LSI) works, the patent provides a simple example, using a set of 9 documents (much smaller than the web as it exists today). The example includes documents that are about human/computer interaction topics. It doesn’t discuss how a process such as this could handle something the size of the Web because nothing that size had quite existed yet then. The Web contains a lot of information and frequently changes, so an approach created to index a known document collection might not be ideal. The patent tells us that an analysis of terms needs to occur “each time there is a significant update in the storage files.”
Google is Using More Modern Language Models
There has been a lot of research and a lot of technology development that can work with a set of documents the size of the Web. We learned from Google that they are using a Word Vector approach developed by the Google Brain team, described in a patent granted in 2017. I wrote about that patent and linked to resources used in the post: Citations behind the Google Brain Word Vector Approach. If you want to get a sense of the technologies that Google may index content and understand words in that content, it had advanced a lot since the days just before the Web started. There are links to papers cited by the inventors of that patent within it. Some of those can relate in some ways to Latent Semantic Indexing since it could be called their ancestor. The LSI technology from 1988 contains some interesting approaches. If you want to learn a lot more about it, this paper is insightful: A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. There are mentions of Latent Semantic Indexing in Patents from Google, where it is used as an example indexing method:
Text classification techniques can be used to classify text into one or more subject matter categories. Text classification/categorization is a research area in information science concerned with assigning text to one or more categories based on its contents. Typical text classification techniques are based on naive Bayes classifiers, tf-idf, latent semantic indexing, support vector machines, and artificial neural networks.
I was inspired to blog on a similar topic in the post: What are LSI Keywords and What I Use Instead of Them?