GLOSSARY

Entity Extraction

Entity extraction finds specific things—like brands, models, colors—in text. Stores use it to auto-fill attributes and power filters, search, and analytics.

What is Entity Extraction?

Entity extraction (named-entity recognition and linking) identifies and classifies real-world entities in text—e.g., Brand, Model, Material, Size, Color—and optionally links them to IDs in a controlled vocabulary or knowledge base.

How It Works (quick)

  • Methods: Rule/dictionary matching, statistical NER (CRF), and transformer models.
  • Normalization: Case/diacritics folding, lemmatization, hyphen variants; resolve brand casing exactly.
  • Linking: Map to canonical IDs (e.g., brand ID), disambiguate senses.
  • Confidence & review: Thresholds route low-confidence hits to editors.
  • Output: Structured fields used for facets, schema, and ranking.

Why It Matters in E-commerce

  • Better filters: Reliable attributes from messy titles/descriptions.
  • SEO & schema: Populate Product/Offer markup; enrich PDP badges.
  • Ops efficiency: Faster onboarding of supplier feeds and UGC moderation.
  • Analytics: Clean aggregations by brand/material/model.

Best Practices

  • Keep brand and model dictionaries current; protect trademarks and casing.
  • Use locale-aware analyzers; maintain per-language synonyms.
  • Store extracted spans for audits; log confidence.
  • Retrain with feedback; version models and vocabularies.
  • Combine text + image signals for fashion/home.

Challenges

  • Ambiguity (Apple fruit vs brand), new brands, noisy vendor copy, multi-word entities.

Examples

  • Extract GORE-TEX, trail running, Merino, EU 45 from product text.
  • Link “tee shirt”T-shirt canonical; “GORETEX”GORE-TEX.

Summary

Entity extraction turns unstructured copy into clean attributes that power filters, schema, and relevance—cutting manual work and boosting findability.

FAQ

Entity extraction vs concept extraction?

Entities are specific, nameable things; concepts are broader topics/attributes. You may run both.

Do I need deep learning?

Start with dictionaries for brands and sizes; add transformers for long-tail robustness.

Where to store results?

In dedicated attribute fields with IDs; keep spans for QA.