What is a Golden Set?
A golden set (gold standard) is a curated, versioned collection of test cases—typically queries with expected results and labels—used to evaluate search relevance and guard against regressions. Unlike live A/B data, it’s stable and audited, so teams can compare releases over time.
How It Works (quick)
- Scope: Choose top intents (navigational, transactional, informational), tricky long-tail, and high-revenue queries.
- Judgments: Human-labeled relevance grades (e.g., 0/1/2), must-have items, and disallowed items.
- Format: Query → candidate results → labels, notes, locale, device.
- Metrics: Compute NDCG/MRR/Precision@k, zero-results, and “must-have present @k”.
- Governance: Version the set (v1, v2…), record annotator guidelines, and keep change logs.
Why It Matters in E-commerce
- Prevents breakage: Catch regressions from synonym tweaks, boosting, or model updates.
- Decision speed: Compare variants offline before A/B tests.
- Alignment: Encodes business rules (compliance, ACLs, brand priorities) into evaluation.
Best Practices
- Balance head vs long-tail, categories, and locales.
- Include SKU/MPN cases, brand+model, and common typos.
- Add “must include/exclude” constraints for policy/assortment.
- Refresh quarterly; track drift and retire stale cases.
- Keep annotator rubrics with examples; do spot audits for consistency.
Challenges
- Bias toward head terms; stale expectations; disagreement between judges; cost to maintain across locales.
Examples
- Query set for “gift card”, “returns”, “winter running shoes 45 gtx”, plus long-tail like “vegan leather tote zipper”.
- Must-have: brand landing page for “gift card”; disallow OOS items in top-10.
Summary
A golden set is your repeatable truth set for search quality. Curate it carefully, version it, and use it to gate changes before they hit customers.