GLOSSARY

Parsing

Parsing converts messy input into structured data. Stores use it to read product feeds, HTML, and PDFs so fields, prices, and specs are reliable.

What is Parsing?

Parsing analyzes input (text, HTML, files, logs) to produce a structured representation—DOM trees, JSON objects, or typed fields. In search pipelines it powers feed ingestion, content extraction, and query understanding.

How It Works (quick)

  • Detect & read: Identify format (HTML, XML, JSON, CSV, PDF) and encoding.
  • Structure: Build DOM/AST; extract tags, text, and attributes; handle microdata/JSON-LD.
  • Normalize: Trim boilerplate, fix encodings, standardize dates/units/currencies.
  • Extract: Map to fields (title, brand, price, size, stock, SKU); capture provenance.
  • Validate: Schemas and required fields; log errors to a DLQ.

Why It Matters in E-commerce

  • Data quality: Clean fields drive facets, ranking, and accurate PDPs.
  • Freshness: Reliable parsing keeps prices/stock correct.
  • SEO: Pull titles/descriptions and structured data consistently.

Best Practices

  • Schematize: Define strict contracts for feeds; enforce required fields.
  • Locale-aware: Parse numbers, dates, and currencies by market.
  • Safety: Sanitize HTML; block scripts; prevent injection.
  • Resilience: Retry with fallbacks; quarantine bad rows; keep idempotent upserts.
  • Provenance: Store source, timestamp, and transform logs for audits.

Challenges

  • Malformed HTML, mixed locales, vendor inconsistencies, giant files, and PDFs that need OCR.

Examples

  • Vendor CSV → normalized brand/material/price; invalid rows quarantined with reasons.
  • HTML help center → extract headings and anchors for snippet answers.

Summary

Parsing turns heterogeneous inputs into trustworthy fields and text. With schemas, locale awareness, and safety, it fuels accurate search, PDPs, and analytics.

FAQ

Parsing vs OCR?

Parsing reads digital text; OCR turns images/scans into text first.

Parsing vs document processing?

Parsing is a step; document processing is the full pipeline (OCR, extraction, enrichment).

Do I need regex?

Often, but combine with DOM/AST and validators for robustness.