What is an Inverted File?
An inverted file is a physical index segment on disk that contains the term dictionary, postings lists (doc IDs, term frequencies, positions), and related metadata/compression. It’s the storage unit the search engine reads to serve queries.
How It Works (quick)
- Segments: New writes create segments; background merges consolidate them.
- Core blocks:
- Term dictionary → pointers to postings.
- Postings lists → doc IDs + positions/offsets for phrases/highlights.
- Skip/blocks & compression → faster jumps, smaller footprint (e.g., block codecs).
- Integrity: Checksums/footers; versioning for safe upgrades.
Why It Matters in E-commerce
- Latency: Well-compressed, block-skippable postings speed search and facets.
- Scale: Efficient files keep storage and I/O in check for large catalogs.
- Freshness: Balanced merge policy avoids query slowdowns during heavy updates.
Best Practices
- Tune merge policy: Avoid many tiny segments; keep merges off peak hours.
- Right fields in the right store: Positions for titles/attributes; doc values for price/rating/stock.
- Compression: Modern codecs; store only what you query.
- Monitoring: Track segment count, merge backlog, cache hit rate, I/O.
- Backups: Snapshot segments and practice restores.
Challenges
- Bloat from unused fields; hot merging under heavy ingest; schema drift across locales.
Examples
- Keep long descriptions in stored fields; use postings with positions for titles.
- Add a vector column in parallel files only if you run semantic recall.
Summary
The inverted file is how your logical inverted index lives on disk. Good merge/compression strategy and field hygiene deliver fast, reliable storefront search.