Corpus

A corpus is a large, organized collection of text or data. Stores use corpora like product copy, reviews, and help articles to train and evaluate search and NLP.

Example H2

Example H3

Example H4

Example H5

Example H6

What is a Corpus?

A corpus (plural: corpora) is a curated dataset—usually text, sometimes multimodal—compiled for training, tuning, or evaluating search and language systems. In e-commerce, typical corpora include product titles/descriptions, attributes, FAQs, policies, chats, and reviews.

How It Works (quick)

Sourcing: Export from CMS/PIM, help center, reviews, chat logs (with consent).
Cleaning & normalization: Deduplicate, fix encodings, strip boilerplate, standardize attributes and units.
Annotations (optional): Labels for entities, attributes, intents, or relevance judgments.
Splits: Train/validation/test; keep time-based splits to detect drift.
Governance: Track licenses/consents, PII redaction, versioning, and lineage.

Why it Matters in E-commerce

Better models: High-quality corpora improve tokenization, synonyms, embeddings, and attribute extraction.
Evaluation you trust: Realistic test sets gauge recall/precision and business KPIs.
Multilingual scale: Per-locale corpora enable strong international search.

Best Practices

Representativeness: Include head + long-tail queries/content; cover seasonal items.
PII hygiene: Remove personal data; aggregate where possible.
Balanced labels: Avoid bias toward a few brands/categories.
Version & monitor: Corpus v1, v2… link model versions to corpus versions.
Legal: Respect content rights; document third-party sources.

Challenges

Drift: New brands/terms appear; old ones fade—keep updating.
Noise: Vendor copy/paste, HTML artifacts, duplicates.
Imbalance: Over-represented categories distort learning.

Examples

Query–result relevance set for NDCG/MRR evaluation.
Attribute extraction corpus with labeled brand/material/use-case.
Multilingual product corpus for stemming/synonyms per market.

Summary

A well-governed corpus is the foundation for trustworthy search and NLP. Curate, clean, and version it—with PII safeguards—to power accurate, multilingual discovery and reliable evaluation.

FAQ

Is a dataset the same as a corpus?

A corpus is a dataset, typically text-centric and curated for language/search tasks.

Do I need human labels?

For evaluation and extraction tasks, yes. For unsupervised pretraining, not necessarily.

How big should it be?

Enough to reflect your catalog and languages; prioritize quality and coverage over sheer size.