GLOSSARY

Crawler

A crawler discovers and fetches pages so they can be indexed. For stores, it finds categories and products, respects robots rules, and helps keep search results up to date.

What is a Crawler?

A crawler (spider/robot) is software that discovers URLs, fetches their content, and passes data to an indexer. Crawlers can be public (e.g., search engines) or private (your site crawler for internal search).

How it Works (quick)

  • Discovery: Start from seeds (homepage, sitemaps, feeds) → follow links.
  • Rules: Respect robots.txt, nofollow, crawl-delay, and allow/deny patterns.
  • Fetch & render: Download HTML; optionally render JavaScript; extract canonical, links, and structured data.
  • Deduplicate & schedule: Avoid duplicates, detect canonicals, and revisit based on change signals.
  • Export/index: Send fields to the index with timestamps and ACL metadata where relevant.

Why it Matters in E-commerce

  • Coverage: Ensures categories/PDPs are found and kept fresh.
  • Speed: Picks up price/stock changes (if pages expose them in HTML/JSON).
  • SEO hygiene: Surfaces issues (redirect chains, broken canonicals, blocked pages).

Best Practices

  • Provide sitemaps and stable linking; avoid orphan pages.
  • Control faceted URLs: canonicalize or block infinite combinations.
  • Pre-render critical content or enable SSR/dynamic rendering.
  • Use HTTP caching (ETag/Last-Modified) to reduce load.
  • Prioritize: categories > popular PDPs > long-tail; set revisit rates.
  • Monitor with logs and crawl dashboards; fix 4xx/5xx fast.

Challenges

  • Crawl explosion from filters/pagination.
  • JavaScript-only content hidden without rendering support.
  • Parameter noise, locale duplicates, and session IDs.

Examples

  • Nightly internal crawler updates PDP freshness and flags broken links.
  • JS-heavy catalog uses dynamic rendering so the crawler sees full HTML.

Summary

Crawlers keep your index complete and fresh. With clean linking, sitemaps, and rendering strategies, they discover the right pages without blowing up crawl budget.

FAQ

Crawler vs connector? Crawler reads pages; connector reads structured sources/APIs. Many stacks use both.

Do I need JS rendering? If critical content loads via JS—yes. Prefer SSR/prerender for reliability.

How to handle filters? Canonicals for core combinations; block noisy params; keep one canonical URL per theme.