There was a park in the town in Virginia where I used to live that had been a railroad track that was turned into a walking path. At one place near that track was a historic turntable where cargo trains might be unloaded so that they could be added to later trains or trains headed in the opposite direction. This is a technology that is no longer used but it is an example of how technology changes and evolves over time.
There are people who write about SEO who have insisted that Google uses a technology called Latent Semantic Indexing to index content on the Web, but make those claims without any proof to back them up. I thought it might be helpful to explore that technology and its sources in more detail. It is a technology that was invented before the Web was around, to index the contents of document collections that don’t change much. LSI might be like the railroad turntables that used to be used on railroad lines.
There is also a website which offers “LSI keywords” to searchers but doesn’t provide any information about how they generate those keywords or use LSI technology to generate them, or provide any proof that they make a difference in how a search engine such as Google might index content that contains those keywords. How is using “LSI Keywords” different from keyword stuffing that Google tells us not to do. Google tells us that we should:
Focus on creating useful, information-rich content that uses keywords appropriately and in context.
Where does LSI come from
One of Microsoft’s researchers and search engineers, Susan Dumais was an inventor behind a technology referred to as Latent Semantic Indexing which she worked on developing at Bell Labs. There are links on her home page that provide access to many of the technologies that she worked upon while performing research at Microsoft which are very informative and provide many insights into how search engines perform different tasks. Spending time with them is highly recommended.
She performed earlier research before joining Microsoft at Bell Labs, including writing about Indexing by Latent Semantic Analysis. She was also granted a patent as a co-inventor on the process. Note that this patent was filed in April of 1989, and was published in August of 1992. The World Wide Web didn’t go live until August 1991. The LSI patent is:
Computer information retrieval using latent semantic structure
Inventors: Scott C. Deerwester, Susan T. Dumais, George W. Furnas, Richard A. Harshman, Thomas K. Landauer, Karen E. Lochbaum, and Lynn A. Streeter
Assigned to: Bell Communications Research, Inc.
US Patent: 4,839,853
Granted: June 13, 1989
Filed: September 15, 1988
A methodology for retrieving textual data objects is disclosed. The information is treated in the statistical domain by presuming that there is an underlying, latent semantic structure in the usage of words in the data objects. Estimates to this latent structure are utilized to represent and retrieve objects. A user query is recouched in the new statistical domain and then processed in the computer system to extract the underlying meaning to respond to the query.
The problem that LSI was intended to solve:
Because human word use is characterized by extensive synonymy and polysemy, straightforward term-matching schemes have serious shortcomings–relevant materials will be missed because different people describe the same topic using different words and, because the same word can have different meanings, irrelevant material will be retrieved. The basic problem may be simply summarized by stating that people want to access information based on meaning, but the words they select do not adequately express intended meaning. Previous attempts to improve standard word searching and overcome the diversity in human word usage have involved: restricting the allowable vocabulary and training intermediaries to generate indexing and search keys; hand-crafting thesauri to provide synonyms; or constructing explicit models of the relevant domain knowledge. Not only are these methods expert-labor intensive, but they are often not very successful.
The summary section of the patent tells us that there is a potential solution to this problem. Keep in mind that this was developed before the world wide web grew to become the very large source of information that it is, today:
These shortcomings, as well as other deficiencies and limitations of information retrieval, are obviated, in accordance with the present invention, by automatically constructing a semantic space for retrieval. This is effected by treating the unreliability of observed word-to-text object association data as a statistical problem. The basic postulate is that there is an underlying latent semantic structure in word usage data that is partially hidden or obscured by the variability of word choice. A statistical approach is utilized to estimate this latent structure and uncover the latent meaning. Words, the text objects and, later, user queries are processed to extract this underlying meaning and the new, latent semantic structure domain is then used to represent and retrieve information.
To illustrate how LSI works, the patent provides a simple example, using a set of 9 documents (much smaller than the web as it exists today). The example includes documents that are about human/computer interaction topics. It really doesn’t discuss how a process such as this could handle something the size of the Web because nothing that size had quite existed yet at that point in time. The Web contains a lot of information and goes through changes frequently, so an approach that was created to index a known document collection might not be ideal. The patent tells us that an analysis of terms needs to take place, “each time there is a significant update in the storage files.”
There has been a lot of research and a lot of development of technology that can be applied to a set of documents the size of the Web. We learned, from Google that they are using a Word Vector approach developed by the Google Brain team, which was described in a patent that was granted in 2017. I wrote about that patent and linked to resources that it used in the post: Citations behind the Google Brain Word Vector Approach. If you want to get a sense of the technologies that Google may be using to index content and understand words in that content, it has advanced a lot since the days just before the Web started. There are links to papers cited by the inventors of that patent within it. Some of those may be related in some ways to Latent Semantic Indexing since it could be called their ancestor. The LSI technology that was invented in 1988 contains some interesting approaches, and if you want to learn a lot more about it, this paper is really insightful: A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge. There are mentions of Latent Semantic Indexing in Patents from Google, where it is used as an example indexing method:
Text classification techniques can be used to classify text into one or more subject matter categories. Text classification/categorization is a research area in information science that is concerned with assigning text to one or more categories based on its contents. Typical text classification techniques are based on naive Bayes classifiers, tf-idf, latent semantic indexing, support vector machines and artificial neural networks, for example.