Files
proxmox/docs/04-configuration/WORMHOLE_AI_RESOURCES_RAG.md
defiQUG 0f70fb6c90 feat(wormhole): AI docs mirror, MCP server, playbook, RAG, verify script
- Playbook + RAG doc; Cursor rule; sync script + manifest snapshot
- mcp-wormhole-docs: resources + wormhole_doc_search (read-only)
- verify-wormhole-ai-docs-setup.sh health check

Wire pnpm-workspace + lockfile + AGENTS/MCP_SETUP/MASTER_INDEX in a follow-up if not already committed.

Made-with: Cursor
2026-03-31 21:05:06 -07:00

3.2 KiB
Raw Blame History

Wormhole llms-full.jsonl — RAG and chunking strategy

Purpose: How to index Wormholes full documentation export for retrieval-augmented generation without blowing context limits or drowning out Chain 138 canonical facts.

Prerequisite: Download the corpus with INCLUDE_FULL_JSONL=1 bash scripts/doc/sync-wormhole-ai-resources.sh (or --full-jsonl). File: third-party/wormhole-ai-docs/llms-full.jsonl (gitignored; large).

Playbook (tiers): WORMHOLE_AI_RESOURCES_LLM_PLAYBOOK.md


Category-first retrieval (default policy)

  1. Before querying llms-full.jsonl, resolve intent:
    • Broad protocol → start from mirrored categories/basics.md or reference.md.
    • Product-specific → pick the matching category file (ntt.md, cctp.md, typescript-sdk.md, etc.) from the mirror or https://wormhole.com/docs/ai/categories/<name>.md.
  2. Use site-index.json (tier 2) to rank page-level id / title / preview / categories and obtain html_url / resolved_md_url.
  3. Only then ingest or search full JSONL lines that correspond to those pages (if your pipeline supports filtering by id or URL prefix).

This keeps answers aligned with Wormholes own doc structure and reduces irrelevant hits.


Chunking llms-full.jsonl

The file is JSON Lines: each line is one JSON object (typically one doc page or chunk with metadata).

Recommended:

  • Parse line-by-line (streaming); do not load the entire file into RAM for parsing.
  • One line = one logical chunk if each object already represents a single page; if objects are huge, split on sections or headings when present in the schema.
  • Metadata to store per chunk: at minimum id, title, slug, html_url, and any categories / hash fields present in that line. Prefer storing source URL for citation in agent answers.
  • Embeddings: embed title + "\n\n" + body_or_preview (or equivalent text field in the object); keep URL in metadata only for the retriever to return to the user.

Deduplication: if the same hash or id appears across syncs, replace vectors for that id on re-index.


Query flow (RAG)

flowchart TD
  Q[User query] --> Intent{Wormhole product area?}
  Intent -->|yes| Cat[Retrieve from category md slice]
  Intent -->|unclear| Idx[Search site-index.json previews]
  Cat --> NeedFull{Need deeper text?}
  Idx --> NeedFull
  NeedFull -->|no| Ans[Answer with citations]
  NeedFull -->|yes| JSONL[Vector search filtered llms-full.jsonl by category or id]
  JSONL --> Ans

Boundaries

  • RAG over Wormhole docs improves Wormhole answers; it does not override EXPLORER_TOKEN_LIST_CROSSCHECK.md or CCIP runbooks for Chain 138 deployment truth.
  • If a user question mixes both (e.g. “bridge USDC to Chain 138 via Wormhole”), answer in two explicit sections: Wormhole mechanics vs this repos CCIP / 138 facts.

Re-sync and audit

  • After sync-wormhole-ai-resources.sh, commit or archive third-party/wormhole-ai-docs/manifest.json when you want a recorded snapshot (hashes per file).
  • Rebuild or delta-update the vector index when manifest.json changes.