Friday, April 24, 2009

FAST search engine for SharePoint PART 1 (Dictionaries)

In this post I’d like to introduce you to FAST ESP dictionaries and their benefits.

FAST ESP (Enterprise Search Platform) has many prebuilt dictionaries. Dictionaries that support basic entity extraction such as: locations, people names, and company names. Dictionaries that help to facilitate a context recognition through the use of lemmatization, synonyms, spelling variations, stop words elimination, etc…

Lemmatization: generally speaking, lemmatization means the mapping of a word to its base form or its other inflectional forms. Ex: searching for “to do” will bring “doing”, and “done”.

Dictionaries also support advance phrase recognition, but this feature and lemmatization do not go together, they are mutually exclusive. Phrase recognition provides you the following: If I’m executing query on “SharePoint for Squirrels” Fast will immediately recognize that “SharePoint for Squirrels” is a PHRASE and will not bring absolutely irrelevant results with “Count of squirrels in SharePoint”, or “maintaining squirrels database in SharePoint” mentioned in the source documents.

Dictionaries as almost any feature of FAST is a big topic that can fit into a small book, but I’ll try to cover at a high level some of the OOTB dictionaries (index and/or query side):

  1. Phonetic search
    1. Used for detection of sound-a-like words
  2. Spelling
    1. Within spelling dictionary each word has a weight value which might be used to influence ranking value. Weigh value is based on the frequency of the word in specific language. Lower values mean more frequency of usage.
  3. Synonyms
  4. Tokenization
    1. Detection of characters that might be irrelevant in the query, ex: “-“, white spaces, etc.
  5. Character normalization
    1. Used for characters with accents that are not commonly supplied by users
  6. Locations and Proper Name recognition
    1. Used to identify locations, addresses, and names of people and companies.
  7. Anti-phrasing and stop words
    1. used to eliminate irrelevant words or often repeated words

FAST also provides automatic language detection in 70 languages, and dictionaries are language specific.

Custom FAST Dictionaries

Now a days almost every company is using their own terminology, or industry, or department specific jargon, and abbreviations. When searching for information people are using keywords in their queries, keywords that make  sense to them, ex: If I’m searching for “PTO”, I’ll get results with documents where “PTO” is mentioned, but not “vacation”, or “personal day”. While entering “PTO” in my search query, I simply wanted to lookup my company’s vacation days policy. When people are searching for information it is very important for search engine to understand what they REALY are looking for in the context of their keywords.

Dictionaries can be used on the content processing side (document pipelines) before document even gets indexed, and supply more information about a document into the index.

Though it will contribute to index growth, but will make query results serving a bit faster, since less work will have to be done there.

Also they can be used on the query processing side, which might directly affect query performance.

Keep in mind that almost all updates to index side dictionaries will require re-indexing of the content.

Just an idea on building custom dictionary: I’ve heard this idea at the first FAST Forward Conference 2008 in Orlando.

What is we start using internet to built our dictionaries. Internet is the biggest data repository in the world, and data on the Internet is often contained in tables. Table structure can be easily recognized in the document Pipeline of FAST.

Example of FAST indexing a web page with cars table:

Make Models    
BMW 325 XI Sedan  
Acura MDX SUV  
Acura RL Sedan  
Acura TL Sedan  
Acura RSX SUV  
Acura TSX SUV  

By utilizing GetAttributes() method in document processor, FAST can return this document as a dictionary, it will populate your  custom dictionary based on values from the above table. Just think about all the opportunities WWW opens for you :-)

When I say “document”, do not confuse it with MS Word document, document in FAST vernacular is any type of information entity: DB record, web page, file, etc.)

How will it help? let’s say that someone goes to my FAST Search engine and decides to find a car by entering  “Acura sedan” into the search box. While documents in my index do not have “sedan” keyword, but they do have make and model names. Without dictionary based on the above table, search results would bring everything on “Acura”. But with this dictionary through the matching stage in my document pipeline, FAST will identify relevant documents in my index by matching “RL”, and “TL” models to “sedan” in the dictionary. And return both sedan models in result set as opposed to the whole bunch of irrelevant content where “Acura” keyword is mentioned.

FAST dictionaries are created, populated and viewed using Dictionary manager from command line script dictman. You can use this utility in interactive mode for manual population of dictionary and in non-interactive mode, to populate dictionary through a batch file.

Dictionaries are used to improve relevance and can drastically improve search results if properly deployed.

There are pros and cons to deploying different dictionaries, so there is a pre-planning process that has to take place. Considerations must be given to what dictionaries to deploy based on the content that you want to serve and business requirements, when to deploy index side dictionaries and when to deploy it on the query end. This is more of a “Best Practices” talk, I’ll try to cover it in some other posts.

Enjoy :-)

1 comment:

Sebastian said...

this is quit a good article, keep on blogging :-)