GLOSSARY

Inverted Index

An inverted index maps each term to the documents that contain it. In e-commerce, it’s how queries instantly find matching products by words in titles, attributes, and descriptions.

What is an Inverted Index?

An inverted index is the core retrieval structure for text search: a term → postings list mapping. Each postings list stores doc IDs (and often positions) for documents containing the term, enabling fast lookups and phrase/proximity queries.

How It Works (quick)

  • Tokenize & normalize: Lowercase, fold accents, handle hyphens; create unigrams and (optionally) bigrams.
  • Build postings: For each term and field, store doc IDs, term frequency, and positions.
  • Query time: Look up postings for query terms → intersect/union → score (BM25/TF-IDF) → re-rank.
  • Fields & boosts: Separate fields (title/attributes/description) with BM25F weighting.

Why It Matters in E-commerce

  • Millisecond recall: Finds candidates fast before re-ranking or semantic steps.
  • Explainability: Highlights and snippets come from positions/offsets.
  • Cost-effective: Lexical retrieval is cheap and robust at scale.

Best Practices

  • Field design: Exact fields for SKU/MPN, analyzers for titles, bigrams for common phrases.
  • Locale analyzers: Language-aware tokenization/stemming per market.
  • Stopwords & synonyms: Filter obvious stopwords; apply synonyms at query time (late binding).
  • Hybrid pipeline: Use inverted index for recall, vectors for semantics, then re-rank/LTR.
  • Maintenance: Rebuild after mapping/analyzer changes; monitor index health.

Challenges

  • Vocabulary drift: New brands/models require analyzer/vocabulary upkeep.
  • Phrase sensitivity: Missing positions breaks phrase/proximity queries.
  • Multi-locale: Consistency across analyzers is non-trivial.

Examples

  • Query “air max 270” → intersect postings for air, max, 270 with strong bigram/phrase signals.
  • Query “gore-tex jacket” → normalized tokens map to the right postings despite hyphenation.

Summary

The inverted index is the backbone of fast, explainable text retrieval. With clean analyzers, smart fielding, and a hybrid pipeline, it powers accurate product search at scale.

FAQ

Inverted index vs database index?

A DB index speeds equality/range on columns; an inverted index is built for full-text retrieval.

Do I still need vectors?

Vectors improve semantic recall, but keep the inverted index for precision, speed, and filtering.

Positions necessary?

Yes for phrase/highlight/proximity; optional for pure bag-of-words.