What is Linguistic Indexing?
Linguistic indexing augments basic token storage with linguistic signals: lemmas/stems, phrases (bigrams/trigrams), part-of-speech cues, entities/attributes, and locale rules. The goal is higher recall and cleaner precision using language-aware fields.
How It Works (quick)
- Pre-processing: Tokenize, normalize case/diacritics, split compounds.
- Morphology: Store lemma fields alongside surface forms; optional stems.
- Phrases: Build bigram/phrase fields for common multi-word units.
- Entities & attributes: Extract brand, model, material, size into structured fields.
- Per-field analyzers: Titles vs descriptions vs attributes use different analyzers.
- Hybrid ready: Works with lexical scoring and feeds features to re-rankers/LTR.
Why It Matters in E-commerce
- Intent capture: Matches natural phrases (“air max 270”) and inflections.
- Quality facets: Reliable attributes enable filters and clean URLs.
- Explainability: Highlights come from linguistic fields, aiding trust and CTR.
Best Practices
- Design a field schema: exact (SKU/brand), lemma, phrase, attributes.
- Use locale-specific analyzers and vocabularies.
- Keep synonyms late-bound to avoid index bloat; store safe normals (e.g., hyphen variants).
- Log field contributions for debugging; rebuild after analyzer updates.
- Pair with vector recall but keep lexical as the fast baseline.
Challenges
- Maintenance across languages; OOV brands/models; over-indexing increases size and merges.
Examples
- Title bigram field lifts “air max”; lemma field connects “jackets/jacket”.
- Entity extraction writes GORE-TEX, EU 45, trail into attributes for facets.
Summary
Linguistic indexing layers lemmas, phrases, and entities onto your index so lexical retrieval becomes robust, explainable, and multilingual—perfect for product search.