What is IDF?
Inverse Document Frequency (IDF) measures how informative a term is across the collection. Terms that appear in many documents get a low IDF; rare terms get a high IDF. IDF is a core piece of TF-IDF and BM25 scoring.
Typical (smoothed) formula:
idf(t) = ln((N − df(t) + 0.5) / (df(t) + 0.5) + 1)
where N is the number of documents and df(t) is the number containing term t.
How It Works (quick)
- Compute document frequency per token/field.
- Apply log scaling and smoothing to avoid extreme values.
- Combine with term frequency (TF) and field weights (BM25F) at ranking time.
Why It Matters in E-commerce
- Precision at top-k: Rare, intent-rich tokens (brand+model, materials) influence ranking more.
- Disambiguation: Separates “trail running” from generic “running”.
- Healthy long-tail: Niche queries surface the right SKUs.
Best Practices
- Per-field IDF: Titles/attributes vs descriptions; avoid one global df.
- Per-locale stats: Compute IDF by language/storefront.
- Stopword lists: Remove ultra-common words and navigational boilerplate.
- Variant handling: Normalize hyphens/accents (e.g., “gore-tex” ↔ “gore tex”).
- Rebuild cadence: Refresh IDF after big catalog changes.
Challenges
- Catalog churn: Rapid adds/removes shift df(t).
- Synonym expansion: Can blur rarity if applied too early; prefer late binding.
- Very frequent brand terms: Cap influence so a mega-brand doesn’t dominate.
Examples
- Query “merino base layer” → high IDF on merino boosts relevant apparel.
- Query “air max 270” → model number tokens get strong IDF, lifting exact items.
Summary
IDF rewards informative terms and dampens generic ones, sharpening ranking for intent-rich queries—especially valuable in product search.