GLOSSARY

Stemming

Stemming reduces words to a common root (run → run, running → run). It boosts recall but can hurt precision.

Example H2

Example H3

Example H4

Example H5

Example H6

What is Stemming?

Stemming trims words to a root form using heuristic rules (e.g., Porter stemmer). Unlike lemmatization (which uses vocabulary/grammar to find the dictionary lemma), stemming is faster but rougher.

How It Works (quick)

Tokenize → stem: Apply language-specific suffix rules (e.g., “shoes” → “shoe”, “running” → “run”).
Index & query: Stem at both index and query time to align forms.
Field strategy: Apply stemming on long text fields; keep exact fields (brand/SKU) unstemmed.

Why It Matters in E-commerce

Recall: Matches morphological variants (“jacket” ↔ “jackets”).
Localization: Helps across plural/case systems (EN, DE, ES), if tuned per locale.

Best Practices

Mix with lemmatization: Prefer lemma where available; fall back to stem for speed.
Per-locale analyzers: Use the right stemmer (or none) per language.
Protect precision: Never stem SKU/MPN/brand exact fields; be cautious in titles.
Phrase integrity: Combine with phrase/proximity so stems don’t break units.
Evaluate: Track Precision/Recall trade-offs and zero-result changes.

Challenges

Over-stemming (“organization” → “organ”), under-stemming, compounds (e.g., German), and brand collisions.

Examples

Query “jackets” finds items titled “jacket.”
Stem in descriptions, but keep exact brand unstemmed to avoid drift.

Summary

Use stemming to lift recall on general text, but protect exact fields and high-precision phrases. Favor lemmatization where quality matters, and measure the impact.

‍

FAQ

Stemming vs lemmatization? Lemmas are more accurate; stems are faster.

Use stemming with vectors? Yes—lexical match still benefits from sensible normalization.