What is Language Detection?
Language detection (lang-id) identifies the primary language of input text (and sometimes script/locale). It’s used to route queries and content to the proper tokenizer, stemmer, stopwords, and synonym sets so search behaves correctly per market.
How It Works (quick)
- Signals: Character n-grams, word lists, Unicode script, punctuation and diacritics.
- Models: Naive Bayes/SVM, fastTextstyle classifiers, or compact neural models (e.g., CLD3-like).
- Outputs: ISO codes (e.g.,
en
, de
, hu
) plus a confidence score; optional multi-language tags. - Routing: Query/content → detected language → select analyzer, synonyms, spell-check, and locale formatting.
Why It Matters in E-commerce
- Correct matching: Stemming/lemmatization and stopwords must fit the language.
- Synonyms per locale: “Trainers” ↔ “sneakers” (en-GB vs en-US); brand casing preserved.
- Cross-border: Show the right currency/units and prefer local content.
- Evaluation: Segment zero-results/CTR by detected language to spot gaps.
Best Practices
- Confidence thresholds with fallbacks (user profile, domain/locale, Accept-Language).
- Short text handling: Use query history/session signals for 1–2 word inputs.
- Code-switching: Allow multiple tags; prefer the language of key entities (brand/model often untranslatable).
- Caching: Memoize frequent queries; avoid per-keystroke model calls.
- QA: Track misdetections, especially between close pairs (pt↔es, cs↔sk).
- Governance: Store detected language on documents at ingest time; re-run on major copy updates.
Challenges
- Very short queries, brand names, emojis; languages with shared scripts; mixed-language product titles.
Examples
- Query “zapatillas trail impermeables 45” → route to es analyzer, apply Spanish synonyms, return local PDPs.
- PDP copy in de but specs in en → dual-tag fields; search works in both languages.
Summary
Language detection picks the right linguistic and locale settings so search feels native everywhere. Use confidences, fallbacks, and multi-tag support to stay robust on short and mixed inputs.