The right way to introduce FAST would be to give an overview of how the indexing and search is really done and what processes take place.
I’m sure it’s not the best graphic you have ever seen, but I’m not a designer for a reason.
This picture provides a VERY simplified overview of processes, but it is a start.
FAST terminology: document is not a Microsoft Word document per-say (even thought it can be), document is data entity that is subject for indexing. Document can be a file, db record, web page, etc. XML docs are a bit different story, but I’ll talk about it later.
- Content: Content can be in many formats, structured and unstructured. Content sources are defined at a “collection” level, collection has to be created as a very first step that you take when defining a logical grouping of searchable content.
- Connector is assigned to a “Collection”, and is used to feed documents in batches to document processing pipelines.
- Document processing pipelines refine your documents, apply entity extractors, matching process and business logic. Only one pipeline can be associated with a “Collection”. There are many stages that document goes through within the pipeline, but they all can be split into three groups: pre-processing, document manipulation, post-processing.
- If the document is not discarded within pipeline it gets indexed with all appropriate attributes that were assigned to it within the previous step, original content is also being stored in FXML format (FastXML)
- Query Processing pipeline comes into play when end-user submits query, it provides additional analysis to content retrieved from index and further refines the result set.
- Finally, end user interface renders results and provided further ways of drilling into search results.
In the next post I’ll cover Content sources and connectors in more details.