GLOSSARY

Phrase Extraction

Phrase extraction finds important multi-word terms in text (like “waterproof trail shoes”). Stores use it to auto-tag products, generate facets, and power smarter suggestions.

What is Phrase Extraction?

Phrase extraction identifies salient multi-word expressions (MWEs) and keyphrases from text—e.g., “merino base layer”, “USB-C fast charger”. Methods range from linguistic patterns (noun phrases) to statistics (PMI, C-value), graph/ranking (TextRank-style), and embedding similarity.

How It Works (quick)

  • Prep: Clean text, detect language, lemmatize; protect brand/SKU casing.
  • Candidate generation: Noun-phrase patterns, collocations (bigrams/trigrams), sliding windows.
  • Scoring: Frequency, positional bias, TF-IDF/PMI/C-value, or vector relevance to domain topics.
  • Filtering: Stoplists, de-dup by lemma, length limits, headword rules.
  • Linking: Map phrases to taxonomy nodes or knowledge-graph IDs.
  • Output: Tags/attributes for search, facets, and analytics.

Why It Matters in E-commerce

  • Better discovery: Auto-tags feed facets, collections, and product finders.
  • SEO: Surfaces natural long-tail topics for category/collection pages.
  • Ops: Extract specs from vendor copy to fill attributes quickly.

Best Practices

  • Keep a domain lexicon (materials, fits, compatibilities) and refresh quarterly.
  • Normalize hyphens/diacritics; merge singular/plural via lemmas.
  • Whitelist phrases that map to filterable attributes; blacklist vague marketing fluff.
  • Human-in-the-loop: review low-confidence phrases; store spans & confidence.
  • Localize per market; handle compounds (e.g., German) carefully.

Challenges

  • Noisy UGC, near-duplicates (“tee shirt” vs “T-shirt”), trademark sensitivity, mixed languages.

Examples

  • Extract “GORE-TEX membrane”, “packable hood”, “EU 45”, “USB-C PD 65W” from titles/descriptions.
  • Turn frequent phrases into guided search chips or collection pages.

Summary

Phrase extraction converts messy text into actionable, multi-word tags that fuel facets, collections, and relevance—especially valuable for long-tail demand.

FAQ

Phrase extraction vs entity extraction?

Entities are specific names/values; phrases can be broader concepts or attributes.

Does it replace manual tagging?

No—use it to suggest tags, then confirm or correct.

Should I index phrases?

Yes—add bigram/phrase fields for ranking; keep exact fields for SKUs/brands.