- Playbook + RAG doc; Cursor rule; sync script + manifest snapshot - mcp-wormhole-docs: resources + wormhole_doc_search (read-only) - verify-wormhole-ai-docs-setup.sh health check Wire pnpm-workspace + lockfile + AGENTS/MCP_SETUP/MASTER_INDEX in a follow-up if not already committed. Made-with: Cursor
3.2 KiB
3.2 KiB
Wormhole llms-full.jsonl — RAG and chunking strategy
Purpose: How to index Wormhole’s full documentation export for retrieval-augmented generation without blowing context limits or drowning out Chain 138 canonical facts.
Prerequisite: Download the corpus with INCLUDE_FULL_JSONL=1 bash scripts/doc/sync-wormhole-ai-resources.sh (or --full-jsonl). File: third-party/wormhole-ai-docs/llms-full.jsonl (gitignored; large).
Playbook (tiers): WORMHOLE_AI_RESOURCES_LLM_PLAYBOOK.md
Category-first retrieval (default policy)
- Before querying
llms-full.jsonl, resolve intent:- Broad protocol → start from mirrored
categories/basics.mdorreference.md. - Product-specific → pick the matching category file (
ntt.md,cctp.md,typescript-sdk.md, etc.) from the mirror orhttps://wormhole.com/docs/ai/categories/<name>.md.
- Broad protocol → start from mirrored
- Use
site-index.json(tier 2) to rank page-levelid/title/preview/categoriesand obtainhtml_url/resolved_md_url. - Only then ingest or search full JSONL lines that correspond to those pages (if your pipeline supports filtering by
idor URL prefix).
This keeps answers aligned with Wormhole’s own doc structure and reduces irrelevant hits.
Chunking llms-full.jsonl
The file is JSON Lines: each line is one JSON object (typically one doc page or chunk with metadata).
Recommended:
- Parse line-by-line (streaming); do not load the entire file into RAM for parsing.
- One line = one logical chunk if each object already represents a single page; if objects are huge, split on
sectionsor headings when present in the schema. - Metadata to store per chunk: at minimum
id,title,slug,html_url, and anycategories/hashfields present in that line. Prefer storing source URL for citation in agent answers. - Embeddings: embed
title + "\n\n" + body_or_preview(or equivalent text field in the object); keep URL in metadata only for the retriever to return to the user.
Deduplication: if the same hash or id appears across syncs, replace vectors for that id on re-index.
Query flow (RAG)
flowchart TD
Q[User query] --> Intent{Wormhole product area?}
Intent -->|yes| Cat[Retrieve from category md slice]
Intent -->|unclear| Idx[Search site-index.json previews]
Cat --> NeedFull{Need deeper text?}
Idx --> NeedFull
NeedFull -->|no| Ans[Answer with citations]
NeedFull -->|yes| JSONL[Vector search filtered llms-full.jsonl by category or id]
JSONL --> Ans
Boundaries
- RAG over Wormhole docs improves Wormhole answers; it does not override EXPLORER_TOKEN_LIST_CROSSCHECK.md or CCIP runbooks for Chain 138 deployment truth.
- If a user question mixes both (e.g. “bridge USDC to Chain 138 via Wormhole”), answer in two explicit sections: Wormhole mechanics vs this repo’s CCIP / 138 facts.
Re-sync and audit
- After
sync-wormhole-ai-resources.sh, commit or archivethird-party/wormhole-ai-docs/manifest.jsonwhen you want a recorded snapshot (hashes per file). - Rebuild or delta-update the vector index when
manifest.jsonchanges.