GLOSSARY

Bigram Matching

Bigram matching looks for pairs of adjacent words (like “running shoes”) instead of single keywords. In e-commerce, it boosts precision and intent understanding for compound terms, brand+model names, and common phrases.

What is Bigram Matching?

Bigram matching is a retrieval technique that indexes and matches two-word sequences (word bigrams or “shingles”) from queries and documents. It sits between simple unigram (single-word) matching and full phrase search—capturing word order and short phrases without requiring an exact, quoted match.

How it works (quick)

  • Shingling: Build bigrams from tokens: “wireless noise cancelling headphones” → wireless noise, noise cancelling, cancelling headphones.
  • Multi-field scoring: Keep unigram and bigram fields; score both (e.g., BM25/BM25F) and combine with weights.
  • Order & proximity: Bigrams encode order, reducing false positives where words are far apart.
  • Stopword handling: Optionally drop low-value bigrams (e.g., “of the”) and keep informative ones.
  • Synonyms & hyphens: Map variants (“gore tex” ↔ “gore-tex”; “t-shirt” ↔ “tee shirt”) at the bigram level.
  • Fallbacks: If no bigram matches, fall back to unigrams so recall isn’t lost.

Why it matters in e-commerce

  • Compound terms understood: “trail running”, “ice cream”, “air max”, “apple watch” resolve as units.
  • Brand + model accuracy: “nikon d750”, “iphone 15”, “air max 270” get cleaner matches.
  • Fewer false positives: Reduces hits where words appear far apart or in the wrong order.
  • Better SERP quality: Improves autocomplete scoring, category relevance, and PDP ranking.

Best practices

  • Dual index fields: title_unigram, title_bigram (and likewise for key attributes); boost bigrams in titles/attributes more than descriptions.
  • Tune weights: Start with bigram weight at 1.2–1.5× the unigram title score; adjust via A/B tests (NDCG/CTR).
  • Language-aware analyzers: Proper tokenization, case/diacritics folding; be careful with stemming that might break bigrams.
  • Typos & variants: Keep fuzzy on unigrams; use exact on bigrams, with synonym bridges for common variants.
  • Index size control: Limit bigram fields to short, high-signal text (titles, key attributes) to avoid bloat.
  • Hybrid pipeline: Unigram recall → vector/semantic retrieve → re-rank with bigram & semantic signals.

Challenges & trade-offs

  • Index/storage growth: Bigrams multiply tokens; scope fields carefully.
  • Misspellings break pairs: Use unigram fuzzy fallback and spelling correction.
  • Stopwords & hyphens: Poor normalization can drop useful pairs or split brands (e.g., “t-shirt”).
  • Morphology/locales: In highly inflected or CJK languages, use robust segmentation or character n-grams.

Examples (storefront)

  • “air max 270” → strong boosts from air max and max 270 in titles.
  • “wireless charger” → demotes pages where wireless and charger are far apart; promotes tight matches.
  • “ice cream maker” vs “ice maker” → bigrams help disambiguate.
  • “gore tex jacket” ↔ “gore-tex jacket” handled via bigram synonyms.

Summary

Bigram matching captures short phrases and word order, lifting precision on high-value compound terms while keeping recall via unigrams. In modern product search, it’s a low-latency way to improve relevance before semantic retrieval and re-ranking step in.

FAQ

Bigram vs phrase search?

Phrase search requires exact adjacency (often quoted). Bigrams reward adjacency but don’t require a strict phrase match.

Do I also need trigrams?

Use sparingly (brand+model+number). Trigrams add cost; bigrams + unigrams + re-ranking are usually enough.

Will bigrams hurt recall?

Not if you combine with unigrams and keep fallbacks; bigrams sharpen precision while unigrams preserve recall.

How do I add bigrams to BM25?

Index a separate bigram field with a 2-gram analyzer and include it in BM25F with an appropriate boost.

What about multilingual stores?

Use locale-specific analyzers and maintain bigram synonym maps per language.