GLOSSARY

Golden Set

A golden set is a small, trusted test set with correct answers. Stores use it to check if search changes help or hurt before going live.

What is a Golden Set?

A golden set (gold standard) is a curated, versioned collection of test cases—typically queries with expected results and labels—used to evaluate search relevance and guard against regressions. Unlike live A/B data, it’s stable and audited, so teams can compare releases over time.

How It Works (quick)

  • Scope: Choose top intents (navigational, transactional, informational), tricky long-tail, and high-revenue queries.
  • Judgments: Human-labeled relevance grades (e.g., 0/1/2), must-have items, and disallowed items.
  • Format: Query → candidate results → labels, notes, locale, device.
  • Metrics: Compute NDCG/MRR/Precision@k, zero-results, and “must-have present @k”.
  • Governance: Version the set (v1, v2…), record annotator guidelines, and keep change logs.

Why It Matters in E-commerce

  • Prevents breakage: Catch regressions from synonym tweaks, boosting, or model updates.
  • Decision speed: Compare variants offline before A/B tests.
  • Alignment: Encodes business rules (compliance, ACLs, brand priorities) into evaluation.

Best Practices

  • Balance head vs long-tail, categories, and locales.
  • Include SKU/MPN cases, brand+model, and common typos.
  • Add “must include/exclude” constraints for policy/assortment.
  • Refresh quarterly; track drift and retire stale cases.
  • Keep annotator rubrics with examples; do spot audits for consistency.

Challenges

  • Bias toward head terms; stale expectations; disagreement between judges; cost to maintain across locales.

Examples

  • Query set for “gift card”, “returns”, “winter running shoes 45 gtx”, plus long-tail like “vegan leather tote zipper”.
  • Must-have: brand landing page for “gift card”; disallow OOS items in top-10.

Summary

A golden set is your repeatable truth set for search quality. Curate it carefully, version it, and use it to gate changes before they hit customers.

FAQ

Golden set vs A/B testing?

Golden sets are offline and repeatable; A/B shows real behavior. Use both.

How big should it be?

Often 200–1,000 well-chosen queries per locale/category mix.

Can I auto-generate it?

Seed with analytics; humans still judge and finalize.