Invers Document Frequency (IDF)

Inverse Document Frequency down-weights common words and up-weights rare ones. In stores, IDF helps rank products with distinctive terms (e.g., “GORE-TEX”, “merino”) above generic ones (“the”, “shoe”).

Example H2

Example H3

Example H4

Example H5

Example H6

What is IDF?

Inverse Document Frequency (IDF) measures how informative a term is across the collection. Terms that appear in many documents get a low IDF; rare terms get a high IDF. IDF is a core piece of TF-IDF and BM25 scoring.

Typical (smoothed) formula:

idf(t) = ln((N − df(t) + 0.5) / (df(t) + 0.5) + 1)

where N is the number of documents and df(t) is the number containing term t.

How It Works (quick)

Compute document frequency per token/field.
Apply log scaling and smoothing to avoid extreme values.
Combine with term frequency (TF) and field weights (BM25F) at ranking time.

Why It Matters in E-commerce

Precision at top-k: Rare, intent-rich tokens (brand+model, materials) influence ranking more.
Disambiguation: Separates “trail running” from generic “running”.
Healthy long-tail: Niche queries surface the right SKUs.

Best Practices

Per-field IDF: Titles/attributes vs descriptions; avoid one global df.
Per-locale stats: Compute IDF by language/storefront.
Stopword lists: Remove ultra-common words and navigational boilerplate.
Variant handling: Normalize hyphens/accents (e.g., “gore-tex” ↔ “gore tex”).
Rebuild cadence: Refresh IDF after big catalog changes.

Challenges

Catalog churn: Rapid adds/removes shift df(t).
Synonym expansion: Can blur rarity if applied too early; prefer late binding.
Very frequent brand terms: Cap influence so a mega-brand doesn’t dominate.

Examples

Query “merino base layer” → high IDF on merino boosts relevant apparel.
Query “air max 270” → model number tokens get strong IDF, lifting exact items.

Summary

IDF rewards informative terms and dampens generic ones, sharpening ranking for intent-rich queries—especially valuable in product search.

FAQ

IDF vs TF? TF measures how often a term appears in one document; IDF measures how rare it is across documents.

IDF vs BM25? BM25 wraps TF with saturation and adds IDF (plus doc-length normalization).