GLOSSARY

Spider

A spider (crawler) is a bot that discovers and fetches pages. It builds indexes and keeps content fresh.

What is a Spider?

A spider (or web crawler) is an automated agent that discovers URLs, downloads pages, and sends them to an indexer. Spiders can be external (Googlebot, Bingbot) or internal (your site-search/crawl jobs for documentation, PDPs, guides).

How It Works (quick)

Seeds → frontier: Start from seed URLs (sitemaps, links), maintain a priority queue.
Fetch & render: Respect robots.txt and meta robots; fetch HTML; render JS if allowed/needed.
Parse & extract: Follow links, canonical/hreflang, structured data; capture last-modified.
De-duplicate: Normalize URLs (params, fragments); detect near-duplicates.
Rate & budget: Throttle per host; schedule revisits by change rate.

Why It Matters in E-commerce

SEO: Ensures PDPs, categories, and content are discoverable and refreshed.
Site search: Internal spiders feed help centers, blog, and spec sheets into your index.
Governance: Prevents crawl bloat from facets/params and stale variants.

Best Practices

Sitemap hygiene: XML sitemaps with canonical, 200, indexable URLs; keep lastmod accurate.
Robots control: Block infinite facets, duplicates; allow core landing pages.
Canonical & hreflang: Point variants to a single canonical; declare locales.
JS & SPA: Prefer SSR/ISR for crawlable content; avoid blocking resources.
Parameters: Whitelist a few SEO-worthy params; add noindex to the rest.
Performance: Fast 2xx responses; stable 301s; use 410 for permanent removals.

Challenges

Faceted URL explosions, JS-only content, soft 404s, and inconsistent canonicals.

Examples

Category filters generate ?size=45&color=black: blocked from indexing, but still crawlable for UX.
Internal crawler ingests returns policy pages to power answers.

Summary

Spiders make discovery possible. Give them clean sitemaps, strong canonicals, and safe parameter rules so the right pages get crawled and indexed quickly.

‍

AQ

Spider vs indexer? The spider fetches; the indexer stores & analyzes.

How to handle OOS PDPs? Keep live if they have links/traffic; show replacements; keep canonical stable.

‍