GLOSSARY

Stemming

Stemming reduces words to a common root (run → run, running → run). It boosts recall but can hurt precision.

What is Stemming?

Stemming trims words to a root form using heuristic rules (e.g., Porter stemmer). Unlike lemmatization (which uses vocabulary/grammar to find the dictionary lemma), stemming is faster but rougher.

How It Works (quick)

  • Tokenize → stem: Apply language-specific suffix rules (e.g., “shoes” → “shoe”, “running” → “run”).
  • Index & query: Stem at both index and query time to align forms.
  • Field strategy: Apply stemming on long text fields; keep exact fields (brand/SKU) unstemmed.

Why It Matters in E-commerce

  • Recall: Matches morphological variants (“jacket”“jackets”).
  • Localization: Helps across plural/case systems (EN, DE, ES), if tuned per locale.

Best Practices

  • Mix with lemmatization: Prefer lemma where available; fall back to stem for speed.
  • Per-locale analyzers: Use the right stemmer (or none) per language.
  • Protect precision: Never stem SKU/MPN/brand exact fields; be cautious in titles.
  • Phrase integrity: Combine with phrase/proximity so stems don’t break units.
  • Evaluate: Track Precision/Recall trade-offs and zero-result changes.

Challenges

  • Over-stemming (“organization”“organ”), under-stemming, compounds (e.g., German), and brand collisions.

Examples

  • Query “jackets” finds items titled “jacket.”
  • Stem in descriptions, but keep exact brand unstemmed to avoid drift.

Summary

Use stemming to lift recall on general text, but protect exact fields and high-precision phrases. Favor lemmatization where quality matters, and measure the impact.

FAQ

Stemming vs lemmatization? Lemmas are more accurate; stems are faster.

Use stemming with vectors? Yes—lexical match still benefits from sensible normalization.