It’s a PDF document so it’s hard to pull quotes from, but even though it’s written in the type of language most papers are written in, somehow in this writing by Susan Dumais it’s easy to extract some basic, usable concepts and definitions for as someone recently called it, “making it simple enough for a 3 year old to understand.”

Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval

Related to search, a helpful but almost childishly simple concept is that people don’t always use the same words to refer to the same things, so they can easily miss finding relevant documents because the words they use to describe and search for something may not be the same ones as the words the authors of relevant documents used in their sites.

Along with some related issues, the usage of language in search has been a topic of personal fascination for a long time, as was the case in this discussion at WebmasterWorld way back in 2001

Stemming and Keyword Families

What was particularly interesting is that Brett Tabke hunted the thread down and bumped it in the midst of the aftermath a few month after the heated uproar of the Google Florida update that kicked in back in November of 2003.

Getting back to the paper just found, several expressions are used that raise a couple of questions: term frequency, local weighting and global weighting.

Term frequency: the number of times a term appears in a particular document.

Local weighting: Can be construed to be weight given for number of raw occurrences on a specific level.

Global weight: Is referred to as the term’s relative importance in a collection.

Two things come to mind, and always have related to this topic:

The expression “corpus of documents” is often used, which is essentially a collection. A question related to search pertaining to a given site is whether weighting or value would be determined using the “collection” of documents within a given site, or based on weighting factors in relation to and in comparison to a larger collection in a database including other relevant documents.

On a gloval level, whether determining relevancy is expanded to include sites that don’t contain the same terms to refer to the same things , but use synonyms instead, is another issue.

Comments

Comments are closed.