GLOSSARY

Lemmatization

Lemmatization reduces words to their dictionary form (lemma). It improves matching across inflected variants without breaking meaning.

What is Lemmatization?

Lemmatization maps words like running/runs/ranrun using a lexicon and morphological rules. Unlike stemming (which chops suffixes), lemmatization aims for valid base forms, preserving semantics and grammar.

How It Works (quick)

  • Analyze tokens: Detect part of speech and morphology.
  • Lookup & rules: Use language models/lexicons to map to the lemma.
  • Field-aware: Apply to searchable fields (title, description); protect brand, SKU, model fields from changes.
  • Multilingual: Choose analyzers per locale; handle diacritics and compounds.

Why It Matters in E-commerce

  • Better recall: Matches “jackets/jacket’s/jacketed” without over-matching.
  • Cleaner facets: Consistent attribute extraction across variants.
  • Global catalogs: Handles inflection-heavy languages (e.g., Czech, Hungarian, German).

Best Practices

  • Prefer lemmatization over aggressive stemming for product text.
  • Run POS-aware lemmatizers; nouns vs verbs behave differently.
  • Keep exact fields for SKUs/brands untouched.
  • Evaluate per language with NDCG/MRR and error analysis.
  • Fall back to light stemming only where lemmatizers are weak.

Challenges

  • Ambiguity without POS; out-of-vocabulary terms (new brands/models); compound words and hyphenation.

Examples

  • “men’s running jackets” ↔ search aligns with “men running jacket” and “running-jacket”.
  • Hungarian queries match inflected product titles via lemma alignment.

Summary

Lemmatization standardizes word forms to improve recall and precision across languages—ideal for product text—while preserving exact handling for brands and codes.

FAQ

Lemmatization vs stemming? Lemmas are dictionary forms; stemming just strips endings.

Does it slow search? Some, but caching and pre-analysis mitigate it.

Apply to every field? No—skip SKUs/brand casing and exact keyword fields.