Saturday, January 8, 2011

SharePoint Search 101:How does an indexer work? What is Lemmatization? What is Tokenization?

What happens when you index content?

Before the content is indexed the process that content goes through has to ensure that it is normalized and tokenization took place. Before content is tokenized the language of the content is identified and the normalization identifies accented characters and maps it into non-accented to enable phonetic searches. Tokenization process treats characters such as spaces, question marks, commas as word delimiters and splits the content into actual words that should be indexed and processed further. Lemmatization feature adds morphological and spelling variations to the content as well as misspelled or phonetic variants. This allows for search for all morphological and spelling variations as well as misspelled forms or phonetic variants to happen when people execute search queries. Phonetic search for names is available with SharePoint Search 2010 OOTB, more advanced linguistic features are available with FAST Search for SharePoint 2010.

There are several types of lemmatization:

1. Stemming - where the stemma is identified and applied to a word, ex: “doing” has “do” as it’s stemma

2. Lemma - where stemma of the word “worst” would be missed by stemming, but the lemma of the word as “bad” would be identified

3. Verbs - “greeting” may be a form of a noun or can have a verb “to greet” as it’s base, verb or noun dictionaries can provide the right level of lemmatization.

Lemmatization is language specific and allows to provide more relevant search result to end user and enhance the recall of search results.

Enjoy Smile