GLOSSARY

Corpus

A corpus is a large, organized collection of text or data. Stores use corpora like product copy, reviews, and help articles to train and evaluate search and NLP.

What is a Corpus?

A corpus (plural: corpora) is a curated dataset—usually text, sometimes multimodal—compiled for training, tuning, or evaluating search and language systems. In e-commerce, typical corpora include product titles/descriptions, attributes, FAQs, policies, chats, and reviews.

How It Works (quick)

  • Sourcing: Export from CMS/PIM, help center, reviews, chat logs (with consent).
  • Cleaning & normalization: Deduplicate, fix encodings, strip boilerplate, standardize attributes and units.
  • Annotations (optional): Labels for entities, attributes, intents, or relevance judgments.
  • Splits: Train/validation/test; keep time-based splits to detect drift.
  • Governance: Track licenses/consents, PII redaction, versioning, and lineage.

Why it Matters in E-commerce

  • Better models: High-quality corpora improve tokenization, synonyms, embeddings, and attribute extraction.
  • Evaluation you trust: Realistic test sets gauge recall/precision and business KPIs.
  • Multilingual scale: Per-locale corpora enable strong international search.

Best Practices

  • Representativeness: Include head + long-tail queries/content; cover seasonal items.
  • PII hygiene: Remove personal data; aggregate where possible.
  • Balanced labels: Avoid bias toward a few brands/categories.
  • Version & monitor: Corpus v1, v2… link model versions to corpus versions.
  • Legal: Respect content rights; document third-party sources.

Challenges

  • Drift: New brands/terms appear; old ones fade—keep updating.
  • Noise: Vendor copy/paste, HTML artifacts, duplicates.
  • Imbalance: Over-represented categories distort learning.

Examples

  • Query–result relevance set for NDCG/MRR evaluation.
  • Attribute extraction corpus with labeled brand/material/use-case.
  • Multilingual product corpus for stemming/synonyms per market.

Summary

A well-governed corpus is the foundation for trustworthy search and NLP. Curate, clean, and version it—with PII safeguards—to power accurate, multilingual discovery and reliable evaluation.

FAQ

Is a dataset the same as a corpus?

A corpus is a dataset, typically text-centric and curated for language/search tasks.

Do I need human labels?

For evaluation and extraction tasks, yes. For unsupervised pretraining, not necessarily.

How big should it be?

Enough to reflect your catalog and languages; prioritize quality and coverage over sheer size.