Latent Semantic Indexing Explained
Latent Semantic Indexing (LSI) is a technique used to extract information from text documents. It was developed by two researchers at IBM Research in the late 1980s. The idea behind LSI is to create a mathematical model that uses keywords to identify relevant information within a document.
Latent Semantic Indexing is a method of extracting meaning from unstructured data. This means that it can be applied to any type of data, such as emails, web pages or even social media posts.
The video below demonstrates how LSI works by taking alphebetized text (in the Y axis) present across a series of documents (X axis) and to what volume they are present (the shade of the square, darker = more counts of word) and reorganizing them. We can then understand which words are most frequently present amongst which subjects.
Search engines are likely to use LSI variations across models to help them understand the types of words that could be expected about articles of a specific subject. For example, talking about the “cultural significance of the Mona Lisa” without mentioning “the Louvre” could be a flag that the content isn’t as broad, in context of the topic, as it could be.