GLOSSARY

Invers Document Frequency (IDF)

Inverse Document Frequency down-weights common words and up-weights rare ones. In stores, IDF helps rank products with distinctive terms (e.g., “GORE-TEX”, “merino”) above generic ones (“the”, “shoe”).

What is IDF?

Inverse Document Frequency (IDF) measures how informative a term is across the collection. Terms that appear in many documents get a low IDF; rare terms get a high IDF. IDF is a core piece of TF-IDF and BM25 scoring.

Typical (smoothed) formula:

idf(t) = ln((N − df(t) + 0.5) / (df(t) + 0.5) + 1)

where N is the number of documents and df(t) is the number containing term t.

How It Works (quick)

  • Compute document frequency per token/field.
  • Apply log scaling and smoothing to avoid extreme values.
  • Combine with term frequency (TF) and field weights (BM25F) at ranking time.

Why It Matters in E-commerce

  • Precision at top-k: Rare, intent-rich tokens (brand+model, materials) influence ranking more.
  • Disambiguation: Separates “trail running” from generic “running”.
  • Healthy long-tail: Niche queries surface the right SKUs.

Best Practices

  • Per-field IDF: Titles/attributes vs descriptions; avoid one global df.
  • Per-locale stats: Compute IDF by language/storefront.
  • Stopword lists: Remove ultra-common words and navigational boilerplate.
  • Variant handling: Normalize hyphens/accents (e.g., “gore-tex” ↔ “gore tex”).
  • Rebuild cadence: Refresh IDF after big catalog changes.

Challenges

  • Catalog churn: Rapid adds/removes shift df(t).
  • Synonym expansion: Can blur rarity if applied too early; prefer late binding.
  • Very frequent brand terms: Cap influence so a mega-brand doesn’t dominate.

Examples

  • Query “merino base layer” → high IDF on merino boosts relevant apparel.
  • Query “air max 270” → model number tokens get strong IDF, lifting exact items.

Summary

IDF rewards informative terms and dampens generic ones, sharpening ranking for intent-rich queries—especially valuable in product search.

FAQ

IDF vs TF? TF measures how often a term appears in one document; IDF measures how rare it is across documents.

IDF vs BM25? BM25 wraps TF with saturation and adds IDF (plus doc-length normalization).