Inverted Index

An inverted index maps each term to the documents that contain it. In e-commerce, it’s how queries instantly find matching products by words in titles, attributes, and descriptions.

Example H2

Example H3

Example H4

Example H5

Example H6

What is an Inverted Index?

An inverted index is the core retrieval structure for text search: a term → postings list mapping. Each postings list stores doc IDs (and often positions) for documents containing the term, enabling fast lookups and phrase/proximity queries.

How It Works (quick)

Tokenize & normalize: Lowercase, fold accents, handle hyphens; create unigrams and (optionally) bigrams.
Build postings: For each term and field, store doc IDs, term frequency, and positions.
Query time: Look up postings for query terms → intersect/union → score (BM25/TF-IDF) → re-rank.
Fields & boosts: Separate fields (title/attributes/description) with BM25F weighting.

Why It Matters in E-commerce

Millisecond recall: Finds candidates fast before re-ranking or semantic steps.
Explainability: Highlights and snippets come from positions/offsets.
Cost-effective: Lexical retrieval is cheap and robust at scale.

Best Practices

Field design: Exact fields for SKU/MPN, analyzers for titles, bigrams for common phrases.
Locale analyzers: Language-aware tokenization/stemming per market.
Stopwords & synonyms: Filter obvious stopwords; apply synonyms at query time (late binding).
Hybrid pipeline: Use inverted index for recall, vectors for semantics, then re-rank/LTR.
Maintenance: Rebuild after mapping/analyzer changes; monitor index health.

Challenges

Vocabulary drift: New brands/models require analyzer/vocabulary upkeep.
Phrase sensitivity: Missing positions breaks phrase/proximity queries.
Multi-locale: Consistency across analyzers is non-trivial.

Examples

Query “air max 270” → intersect postings for air, max, 270 with strong bigram/phrase signals.
Query “gore-tex jacket” → normalized tokens map to the right postings despite hyphenation.

Summary

The inverted index is the backbone of fast, explainable text retrieval. With clean analyzers, smart fielding, and a hybrid pipeline, it powers accurate product search at scale.

FAQ

Inverted index vs database index?

A DB index speeds equality/range on columns; an inverted index is built for full-text retrieval.

Do I still need vectors?

Vectors improve semantic recall, but keep the inverted index for precision, speed, and filtering.

Positions necessary?

Yes for phrase/highlight/proximity; optional for pure bag-of-words.