GLOSSARY

Clustering

Clustering groups similar things together automatically—like queries, products, or articles. In online stores, it powers smarter navigation, deduping, recommendations, and new category ideas.

What is Clustering?

Clustering is an unsupervised learning technique that groups items so that members of the same group (cluster) are more similar to each other than to items in other groups. In search and e-commerce, you can cluster queries, products, content, or customers to reveal structure, cut noise, and improve discovery.

How Clustering Works (quick)

  • Represent items as vectors: Use text embeddings (titles, descriptions, attributes), behavior signals (co-views), or hybrid features.
  • Choose an algorithm:
    • k-means / mini-batch k-means (fast, needs k)
    • Hierarchical (tree of clusters, interpretable)
    • DBSCAN/HDBSCAN (auto-finds dense groups, flags outliers)
  • Distance metrics: cosine (for embeddings) or Euclidean (numeric).
  • Label & use: Auto-label clusters (top terms) or human-label them; feed clusters into navigation, recommendations, and SEO.

Why It Matters in E-commerce

  • Query clustering: Discover intents and synonyms; create better facets, redirects, or collection pages.
  • Product clustering: Group near-duplicates/variants; improve recommendations and reduce choice overload.
  • Content clustering: Tie guides/FAQs to product themes; surface helpful content on SERP.
  • SEO & merchandising: Spot gaps; spin up collection pages for recurring cluster themes (e.g., “winter trail shoes”).

Best Practices

  • Good representations: Use multilingual product embeddings + key attributes; normalize/standardize numerics.
  • Pick k sensibly: Elbow/silhouette for k-means; use HDBSCAN when k is unknown or data is uneven.
  • Human-in-the-loop: Review labels for top clusters; merge/split as needed.
  • Freshness: Recompute on a schedule (weekly/monthly) and after major launches; track drift.
  • Guardrails: Don’t auto-publish clusters; require thresholds (size, CTR) before making new landing pages.
  • Measurement: For query clusters track zero-results ↓, reformulations ↓, CTR/conv ↑; for product clusters track AOV and click-through between siblings.

Challenges & Trade-offs

  • Drift & seasonality: Clusters shift with trends; stale clusters mislead.
  • Imbalanced clusters: Some are huge, others tiny; may need re-weighting or split/merge rules.
  • Interpretability: Embedding clusters need clear, human-readable labels.
  • Cold start: Sparse data hurts quality; bootstrap with attributes and editorial input.

Examples (storefront)

  • Group long-tail queries like “waterproof trail shoe men 45” with “gtx trail running 45”; build a targeted collection page.
  • Cluster PDPs by use-case + material to power “similar items” and reduce near-duplicate noise.
  • Cluster help topics (returns, size guide, shipping) and surface the right block on SERP.

Summary

Clustering reveals the natural structure in your queries, products, and content. With solid embeddings, periodic refresh, and human review, it unlocks better navigation, smarter recommendations, cleaner catalogs, and SEO-worthy collection ideas—without slowing the storefront.

FAQ

Clustering vs classification?

Clustering is unsupervised (no labels); classification is supervised (predicts known labels).

How do I decide the number of clusters?

Use elbow/silhouette for k-means, or density-based methods (HDBSCAN) that infer clusters automatically.

Where should clustering run—offline or online?

Usually offline/batch with periodic refresh; use online updates only for small adjustments.

How does clustering relate to vector search?

Both use embeddings. Vector search matches one query to items; clustering groups many items/queries to organize the space.

What metrics should I watch?

Silhouette score, cluster purity/size, plus business KPIs (CTR, conversion, zero-results, recommendation CTR).