What is Parsing?
Parsing analyzes input (text, HTML, files, logs) to produce a structured representation—DOM trees, JSON objects, or typed fields. In search pipelines it powers feed ingestion, content extraction, and query understanding.
How It Works (quick)
- Detect & read: Identify format (HTML, XML, JSON, CSV, PDF) and encoding.
- Structure: Build DOM/AST; extract tags, text, and attributes; handle microdata/JSON-LD.
- Normalize: Trim boilerplate, fix encodings, standardize dates/units/currencies.
- Extract: Map to fields (title, brand, price, size, stock, SKU); capture provenance.
- Validate: Schemas and required fields; log errors to a DLQ.
Why It Matters in E-commerce
- Data quality: Clean fields drive facets, ranking, and accurate PDPs.
- Freshness: Reliable parsing keeps prices/stock correct.
- SEO: Pull titles/descriptions and structured data consistently.
Best Practices
- Schematize: Define strict contracts for feeds; enforce required fields.
- Locale-aware: Parse numbers, dates, and currencies by market.
- Safety: Sanitize HTML; block scripts; prevent injection.
- Resilience: Retry with fallbacks; quarantine bad rows; keep idempotent upserts.
- Provenance: Store source, timestamp, and transform logs for audits.
Challenges
- Malformed HTML, mixed locales, vendor inconsistencies, giant files, and PDFs that need OCR.
Examples
- Vendor CSV → normalized brand/material/price; invalid rows quarantined with reasons.
- HTML help center → extract headings and anchors for snippet answers.
Summary
Parsing turns heterogeneous inputs into trustworthy fields and text. With schemas, locale awareness, and safety, it fuels accurate search, PDPs, and analytics.