GLOSSARY

BERT

BERT helps computers understand the meaning of your words, not just the keywords. In online stores, it makes search results more accurate so shoppers find the right products faster.

What is BERT?

BERT is a family of transformer-based NLP models that learn deep, bidirectional context from text. Pretrained on large corpora and then fine-tuned for specific tasks, BERT excels at understanding meaning beyond keywords—making it valuable for search, chat, and content classification.

How BERT works (quick tour)

  • Bidirectional attention: Unlike left-to-right models, BERT attends to words on both sides of a token, capturing richer context.
  • Pretraining objectives:
    • Masked Language Modeling (MLM): Predict randomly masked tokens.
    • Next Sentence Prediction (NSP) (classic BERT): Learn inter-sentence relationships.
  • Fine-tuning: Add a simple head (classification, QA span, similarity) and train briefly on labeled data.
  • Embeddings: BERT turns text into dense vectors; similarity of vectors ≈ semantic similarity.

Why BERT matters for e-commerce search

  • Query understanding: Interprets intent in short, messy queries (e.g., “nike trail gtx men 45”).
  • Semantic recall: Finds relevant products/content even without exact keyword matches.
  • Re-ranking: Improves top-k ordering after a fast lexical recall (BM25) step.
  • Content enrichment: Classifies attributes and extracts entities to power facets and filters.
  • Support deflection: Answers policy/FAQ queries directly on the store’s search page.

Common implementation patterns

  • Dual-encoder (bi-encoder): Encode query and item separately; retrieve via ANN (vector search). Fast, great for recall.
  • Cross-encoder: Score query–document pairs jointly for accurate re-ranking of a small candidate set.
  • Hybrid search: BM25 (lexical) for initial recall → vector retrieval → cross-encoder re-rank.
  • Distilled/compact models: Use DistilBERT, MiniLM, or int8/FP16 quantization for lower latency.
  • Caching & batching: Cache frequent query embeddings; batch requests; use GPUs or high-throughput CPUs.

Best practices

  • Domain adaptation: Further pretrain (continued pretraining) on your product data and support content.
  • Fine-tune with feedback: Use click data and curated labels; evaluate with NDCG@k, MRR, CTR uplift.
  • Guardrails: Handle OOD queries, apply filters/ACLs before ranking, and keep lexical fallback for edge cases.
  • Latency budgets: Aim sub-100–200 ms for retrieval; keep cross-encoder to top 50–200 candidates.
  • A/B systematically: Test query rewrite, retrieval, and re-rank separately to isolate wins.

Challenges & trade-offs

  • Compute cost/latency: Larger models strain SLAs; choose smaller variants or distill.
  • Hallucinated matches: Over-semantic matches can ignore must-have attributes—use filters and hard constraints.
  • Data drift & bias: Periodically retrain; monitor per-segment quality (locale, device, intent).
  • Token limits: Long documents need chunking + pooling strategies.

Examples (e-commerce)

  • Semantic retrieval: “winter waterproof trail shoes men 45” returns Gore-Tex trail models even if titles differ.
  • Re-ranking PDPs: Cross-encoder boosts items whose descriptions truly match “vegan leather tote with zipper.”
  • Attribute extraction: BERT tags material: merino, use-case: commuting from copy to power filters and PDP badges.

Summary

BERT brings genuine language understanding to on-site search. With hybrid retrieval and careful optimization, it boosts semantic recall and ordering while meeting storefront latency and governance requirements.

FAQ

Is BERT the same as GPT?

No. BERT is typically encoder-only (understanding); GPT models are decoder-only (generation). They’re complementary.

Can I use BERT for real-time search?

Yes—with compact models, vector indexes (HNSW/IVF), caching, and a small cross-encoder re-rank.

How do I measure success?

Track NDCG@k/MRR, zero-result rate, CTR, conversion, and query reformulation rate by intent segment.

Multilingual stores?

Use multilingual variants (e.g., mBERT) or per-language models; align vectors across languages if catalogs are shared.