A corpus is a large, organized collection of text or data. Stores use corpora like product copy, reviews, and help articles to train and evaluate search and NLP.
A corpus (plural: corpora) is a curated dataset—usually text, sometimes multimodal—compiled for training, tuning, or evaluating search and language systems. In e-commerce, typical corpora include product titles/descriptions, attributes, FAQs, policies, chats, and reviews.
A well-governed corpus is the foundation for trustworthy search and NLP. Curate, clean, and version it—with PII safeguards—to power accurate, multilingual discovery and reliable evaluation.
Is a dataset the same as a corpus?
A corpus is a dataset, typically text-centric and curated for language/search tasks.
Do I need human labels?
For evaluation and extraction tasks, yes. For unsupervised pretraining, not necessarily.
How big should it be?
Enough to reflect your catalog and languages; prioritize quality and coverage over sheer size.