GLOSSARY

Language Detection

Language detection figures out which language a piece of text is in. Stores use it to pick the right analyzer, synonyms, and currency/units for each market.

What is Language Detection?

Language detection (lang-id) identifies the primary language of input text (and sometimes script/locale). It’s used to route queries and content to the proper tokenizer, stemmer, stopwords, and synonym sets so search behaves correctly per market.

How It Works (quick)

  • Signals: Character n-grams, word lists, Unicode script, punctuation and diacritics.
  • Models: Naive Bayes/SVM, fastTextstyle classifiers, or compact neural models (e.g., CLD3-like).
  • Outputs: ISO codes (e.g., en, de, hu) plus a confidence score; optional multi-language tags.
  • Routing: Query/content → detected language → select analyzer, synonyms, spell-check, and locale formatting.

Why It Matters in E-commerce

  • Correct matching: Stemming/lemmatization and stopwords must fit the language.
  • Synonyms per locale: “Trainers” ↔ “sneakers” (en-GB vs en-US); brand casing preserved.
  • Cross-border: Show the right currency/units and prefer local content.
  • Evaluation: Segment zero-results/CTR by detected language to spot gaps.

Best Practices

  • Confidence thresholds with fallbacks (user profile, domain/locale, Accept-Language).
  • Short text handling: Use query history/session signals for 1–2 word inputs.
  • Code-switching: Allow multiple tags; prefer the language of key entities (brand/model often untranslatable).
  • Caching: Memoize frequent queries; avoid per-keystroke model calls.
  • QA: Track misdetections, especially between close pairs (pt↔es, cs↔sk).
  • Governance: Store detected language on documents at ingest time; re-run on major copy updates.

Challenges

  • Very short queries, brand names, emojis; languages with shared scripts; mixed-language product titles.

Examples

  • Query “zapatillas trail impermeables 45” → route to es analyzer, apply Spanish synonyms, return local PDPs.
  • PDP copy in de but specs in en → dual-tag fields; search works in both languages.

Summary

Language detection picks the right linguistic and locale settings so search feels native everywhere. Use confidences, fallbacks, and multi-tag support to stay robust on short and mixed inputs.

FAQ

Language vs locale? Language is what (en, de); locale adds where/how (en-US vs en-GB for units/currency).

Do I need ML? Yes for reliability at scale; rules alone fail on short, messy inputs.