Linguistic Indexing

Linguistic indexing enriches the index with language knowledge—lemmas, phrases, entities. It makes keyword search smarter without relying only on vectors.

Example H2

Example H3

Example H4

Example H5

Example H6

What is Linguistic Indexing?

Linguistic indexing augments basic token storage with linguistic signals: lemmas/stems, phrases (bigrams/trigrams), part-of-speech cues, entities/attributes, and locale rules. The goal is higher recall and cleaner precision using language-aware fields.

How It Works (quick)

Pre-processing: Tokenize, normalize case/diacritics, split compounds.
Morphology: Store lemma fields alongside surface forms; optional stems.
Phrases: Build bigram/phrase fields for common multi-word units.
Entities & attributes: Extract brand, model, material, size into structured fields.
Per-field analyzers: Titles vs descriptions vs attributes use different analyzers.
Hybrid ready: Works with lexical scoring and feeds features to re-rankers/LTR.

Why It Matters in E-commerce

Intent capture: Matches natural phrases (“air max 270”) and inflections.
Quality facets: Reliable attributes enable filters and clean URLs.
Explainability: Highlights come from linguistic fields, aiding trust and CTR.

Best Practices

Design a field schema: exact (SKU/brand), lemma, phrase, attributes.
Use locale-specific analyzers and vocabularies.
Keep synonyms late-bound to avoid index bloat; store safe normals (e.g., hyphen variants).
Log field contributions for debugging; rebuild after analyzer updates.
Pair with vector recall but keep lexical as the fast baseline.

Challenges

Maintenance across languages; OOV brands/models; over-indexing increases size and merges.

Examples

Title bigram field lifts “air max”; lemma field connects “jackets/jacket”.
Entity extraction writes GORE-TEX, EU 45, trail into attributes for facets.

Summary

Linguistic indexing layers lemmas, phrases, and entities onto your index so lexical retrieval becomes robust, explainable, and multilingual—perfect for product search.

FAQ

Is this the same as semantic search? No—this is language-aware lexical. It pairs well with vectors but stands on its own.

Do I need POS tags at query time? Not always; store signals at index time when helpful.

Won’t this bloat storage? Scope fields carefully and compress; keep synonyms query-time.