Files
proxmox/docs/04-configuration/WORMHOLE_AI_RESOURCES_RAG.md
defiQUG 0f70fb6c90 feat(wormhole): AI docs mirror, MCP server, playbook, RAG, verify script
- Playbook + RAG doc; Cursor rule; sync script + manifest snapshot
- mcp-wormhole-docs: resources + wormhole_doc_search (read-only)
- verify-wormhole-ai-docs-setup.sh health check

Wire pnpm-workspace + lockfile + AGENTS/MCP_SETUP/MASTER_INDEX in a follow-up if not already committed.

Made-with: Cursor
2026-03-31 21:05:06 -07:00

65 lines
3.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Wormhole `llms-full.jsonl` — RAG and chunking strategy
**Purpose:** How to index Wormholes full documentation export for retrieval-augmented generation without blowing context limits or drowning out Chain 138 canonical facts.
**Prerequisite:** Download the corpus with `INCLUDE_FULL_JSONL=1 bash scripts/doc/sync-wormhole-ai-resources.sh` (or `--full-jsonl`). File: `third-party/wormhole-ai-docs/llms-full.jsonl` (gitignored; large).
**Playbook (tiers):** [WORMHOLE_AI_RESOURCES_LLM_PLAYBOOK.md](WORMHOLE_AI_RESOURCES_LLM_PLAYBOOK.md)
---
## Category-first retrieval (default policy)
1. **Before** querying `llms-full.jsonl`, resolve intent:
- **Broad protocol** → start from mirrored `categories/basics.md` or `reference.md`.
- **Product-specific** → pick the matching category file (`ntt.md`, `cctp.md`, `typescript-sdk.md`, etc.) from the mirror or `https://wormhole.com/docs/ai/categories/<name>.md`.
2. Use **`site-index.json`** (tier 2) to rank **page-level** `id` / `title` / `preview` / `categories` and obtain `html_url` / `resolved_md_url`.
3. Only then ingest or search **full JSONL** lines that correspond to those pages (if your pipeline supports filtering by `id` or URL prefix).
This keeps answers aligned with Wormholes own doc structure and reduces irrelevant hits.
---
## Chunking `llms-full.jsonl`
The file is **JSON Lines**: each line is one JSON object (typically one doc page or chunk with metadata).
**Recommended:**
- **Parse line-by-line** (streaming); do not load the entire file into RAM for parsing.
- **One line = one logical chunk** if each object already represents a single page; if objects are huge, split on `sections` or headings when present in the schema.
- **Metadata to store per chunk:** at minimum `id`, `title`, `slug`, `html_url`, and any `categories` / `hash` fields present in that line. Prefer storing **source URL** for citation in agent answers.
- **Embeddings:** embed `title + "\n\n" + body_or_preview` (or equivalent text field in the object); keep URL in metadata only for the retriever to return to the user.
**Deduplication:** if the same `hash` or `id` appears across syncs, replace vectors for that id on re-index.
---
## Query flow (RAG)
```mermaid
flowchart TD
Q[User query] --> Intent{Wormhole product area?}
Intent -->|yes| Cat[Retrieve from category md slice]
Intent -->|unclear| Idx[Search site-index.json previews]
Cat --> NeedFull{Need deeper text?}
Idx --> NeedFull
NeedFull -->|no| Ans[Answer with citations]
NeedFull -->|yes| JSONL[Vector search filtered llms-full.jsonl by category or id]
JSONL --> Ans
```
---
## Boundaries
- RAG over Wormhole docs improves **Wormhole** answers; it does **not** override [EXPLORER_TOKEN_LIST_CROSSCHECK.md](../11-references/EXPLORER_TOKEN_LIST_CROSSCHECK.md) or CCIP runbooks for **Chain 138** deployment truth.
- If a user question mixes both (e.g. “bridge USDC to Chain 138 via Wormhole”), answer in **two explicit sections**: Wormhole mechanics vs this repos CCIP / 138 facts.
---
## Re-sync and audit
- After `sync-wormhole-ai-resources.sh`, commit or archive **`third-party/wormhole-ai-docs/manifest.json`** when you want a recorded snapshot (hashes per file).
- Rebuild or delta-update the vector index when `manifest.json` changes.