GLOSSARY

Computational Linguistics

Computational linguistics studies how to make computers understand and work with human language. In e-commerce search, it powers tokenization, morphology, parsing, and meaning—so results match what shoppers intend.

What is Computational Linguistics?

Computational linguistics (CL) is the discipline at the intersection of linguistics and computer science that builds algorithms and models for language. It spans core tasks—tokenization, morphology, parsing, semantics, pragmatics—and modern methods from probabilistic models to transformers.

How It Works (quick)

  • Text processing: Tokenization, sentence splitting, diacritics/case folding; handling hyphens and compounds.
  • Morphology & lexicon: Lemmatization/stemming; handling inflection/derivation and irregular forms.
  • Parsing & syntax: Part-of-speech tagging and dependency/constituency parsing for structure.
  • Semantics: NER, entity linking, word sense, embeddings, and intent classification.
  • Multilingual: Locale-aware analyzers, transliteration, and cross-lingual representations.

Why It Matters in E-commerce

  • Better query understanding: Handle typos, variants, and inflection (e.g., highly inflected European languages).
  • Attribute extraction: Pull brand, material, color, use-case from titles/descriptions for facets.
  • Search relevance: Improve recall/precision with stemming/lemmatization and synonyms.
  • Global catalogs: Consistent results across languages and scripts.

Best Practices

  • Locale-specific analyzers: Tailor tokenization/morphology per language; don’t one-size-fits-all.
  • Controlled vocabularies: Canonical attributes + synonym lists; align with taxonomy.
  • Hybrid retrieval: Lexical (BM25) plus semantic embeddings; use CL to bridge vocabulary gaps.
  • Quality loop: Evaluate per language with NDCG/CTR and error analysis; retrain on fresh data.
  • Edge cases: Hyphenation (“t-shirt” vs “tee shirt”), compounds, accents, brand casing.

Challenges

  • Ambiguity: Polysemy (e.g., apple fruit vs brand) and compound segmentation.
  • Data sparsity: Low-resource languages and long-tail terms.
  • Performance: NLP pipelines must meet storefront latency targets.
  • Maintenance: Dictionaries/synonyms drift; models need periodic updates.

Examples

  • Attribute tagging: Extract GORE-TEX, trail, men, 45 from titles to power filters.
  • Multilingual analyzers: Handle diacritics and stemming for Central/Eastern European locales.
  • Query rewrite: Normalize “gtx” ↔ “GORE-TEX”; expand sofa ↔ couch.

Summary

Computational linguistics gives search the building blocks to process, analyze, and understand language—fueling accurate, multilingual e-commerce discovery from query to PDP.

FAQ

CL vs NLP—what’s the difference?

NLP is the practical application; CL also studies the linguistic theory and algorithms behind those applications.

Do I still need BM25 with transformers?

Yes—keep lexical recall; use embeddings and re-ranking for semantics.

How do I measure impact?

Track zero-results, NDCG, CTR, conversion—per language and category.

What about typos and diacritics?

Enable fuzzy matching and diacritic folding while preserving exact matches for titles/brands.