What is Stemming?
Stemming trims words to a root form using heuristic rules (e.g., Porter stemmer). Unlike lemmatization (which uses vocabulary/grammar to find the dictionary lemma), stemming is faster but rougher.
How It Works (quick)
- Tokenize → stem: Apply language-specific suffix rules (e.g., “shoes” → “shoe”, “running” → “run”).
- Index & query: Stem at both index and query time to align forms.
- Field strategy: Apply stemming on long text fields; keep exact fields (brand/SKU) unstemmed.
Why It Matters in E-commerce
- Recall: Matches morphological variants (“jacket” ↔ “jackets”).
- Localization: Helps across plural/case systems (EN, DE, ES), if tuned per locale.
Best Practices
- Mix with lemmatization: Prefer lemma where available; fall back to stem for speed.
- Per-locale analyzers: Use the right stemmer (or none) per language.
- Protect precision: Never stem SKU/MPN/brand exact fields; be cautious in titles.
- Phrase integrity: Combine with phrase/proximity so stems don’t break units.
- Evaluate: Track Precision/Recall trade-offs and zero-result changes.
Challenges
- Over-stemming (“organization” → “organ”), under-stemming, compounds (e.g., German), and brand collisions.
Examples
- Query “jackets” finds items titled “jacket.”
- Stem in descriptions, but keep exact brand unstemmed to avoid drift.
Summary
Use stemming to lift recall on general text, but protect exact fields and high-precision phrases. Favor lemmatization where quality matters, and measure the impact.