Golden Set

A golden set is a small, trusted test set with correct answers. Stores use it to check if search changes help or hurt before going live.

Example H2

Example H3

Example H4

Example H5

Example H6

What is a Golden Set?

A golden set (gold standard) is a curated, versioned collection of test cases—typically queries with expected results and labels—used to evaluate search relevance and guard against regressions. Unlike live A/B data, it’s stable and audited, so teams can compare releases over time.

How It Works (quick)

Scope: Choose top intents (navigational, transactional, informational), tricky long-tail, and high-revenue queries.
Judgments: Human-labeled relevance grades (e.g., 0/1/2), must-have items, and disallowed items.
Format: Query → candidate results → labels, notes, locale, device.
Metrics: Compute NDCG/MRR/Precision@k, zero-results, and “must-have present @k”.
Governance: Version the set (v1, v2…), record annotator guidelines, and keep change logs.

Why It Matters in E-commerce

Prevents breakage: Catch regressions from synonym tweaks, boosting, or model updates.
Decision speed: Compare variants offline before A/B tests.
Alignment: Encodes business rules (compliance, ACLs, brand priorities) into evaluation.

Best Practices

Balance head vs long-tail, categories, and locales.
Include SKU/MPN cases, brand+model, and common typos.
Add “must include/exclude” constraints for policy/assortment.
Refresh quarterly; track drift and retire stale cases.
Keep annotator rubrics with examples; do spot audits for consistency.

Challenges

Bias toward head terms; stale expectations; disagreement between judges; cost to maintain across locales.

Examples

Query set for “gift card”, “returns”, “winter running shoes 45 gtx”, plus long-tail like “vegan leather tote zipper”.
Must-have: brand landing page for “gift card”; disallow OOS items in top-10.

Summary

A golden set is your repeatable truth set for search quality. Curate it carefully, version it, and use it to gate changes before they hit customers.

FAQ

Golden set vs A/B testing?

Golden sets are offline and repeatable; A/B shows real behavior. Use both.

How big should it be?

Often 200–1,000 well-chosen queries per locale/category mix.

Can I auto-generate it?

Seed with analytics; humans still judge and finalize.