- Playbook + RAG doc; Cursor rule; sync script + manifest snapshot - mcp-wormhole-docs: resources + wormhole_doc_search (read-only) - verify-wormhole-ai-docs-setup.sh health check Wire pnpm-workspace + lockfile + AGENTS/MCP_SETUP/MASTER_INDEX in a follow-up if not already committed. Made-with: Cursor
65 lines
3.2 KiB
Markdown
65 lines
3.2 KiB
Markdown
# Wormhole `llms-full.jsonl` — RAG and chunking strategy
|
||
|
||
**Purpose:** How to index Wormhole’s full documentation export for retrieval-augmented generation without blowing context limits or drowning out Chain 138 canonical facts.
|
||
|
||
**Prerequisite:** Download the corpus with `INCLUDE_FULL_JSONL=1 bash scripts/doc/sync-wormhole-ai-resources.sh` (or `--full-jsonl`). File: `third-party/wormhole-ai-docs/llms-full.jsonl` (gitignored; large).
|
||
|
||
**Playbook (tiers):** [WORMHOLE_AI_RESOURCES_LLM_PLAYBOOK.md](WORMHOLE_AI_RESOURCES_LLM_PLAYBOOK.md)
|
||
|
||
---
|
||
|
||
## Category-first retrieval (default policy)
|
||
|
||
1. **Before** querying `llms-full.jsonl`, resolve intent:
|
||
- **Broad protocol** → start from mirrored `categories/basics.md` or `reference.md`.
|
||
- **Product-specific** → pick the matching category file (`ntt.md`, `cctp.md`, `typescript-sdk.md`, etc.) from the mirror or `https://wormhole.com/docs/ai/categories/<name>.md`.
|
||
2. Use **`site-index.json`** (tier 2) to rank **page-level** `id` / `title` / `preview` / `categories` and obtain `html_url` / `resolved_md_url`.
|
||
3. Only then ingest or search **full JSONL** lines that correspond to those pages (if your pipeline supports filtering by `id` or URL prefix).
|
||
|
||
This keeps answers aligned with Wormhole’s own doc structure and reduces irrelevant hits.
|
||
|
||
---
|
||
|
||
## Chunking `llms-full.jsonl`
|
||
|
||
The file is **JSON Lines**: each line is one JSON object (typically one doc page or chunk with metadata).
|
||
|
||
**Recommended:**
|
||
|
||
- **Parse line-by-line** (streaming); do not load the entire file into RAM for parsing.
|
||
- **One line = one logical chunk** if each object already represents a single page; if objects are huge, split on `sections` or headings when present in the schema.
|
||
- **Metadata to store per chunk:** at minimum `id`, `title`, `slug`, `html_url`, and any `categories` / `hash` fields present in that line. Prefer storing **source URL** for citation in agent answers.
|
||
- **Embeddings:** embed `title + "\n\n" + body_or_preview` (or equivalent text field in the object); keep URL in metadata only for the retriever to return to the user.
|
||
|
||
**Deduplication:** if the same `hash` or `id` appears across syncs, replace vectors for that id on re-index.
|
||
|
||
---
|
||
|
||
## Query flow (RAG)
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
Q[User query] --> Intent{Wormhole product area?}
|
||
Intent -->|yes| Cat[Retrieve from category md slice]
|
||
Intent -->|unclear| Idx[Search site-index.json previews]
|
||
Cat --> NeedFull{Need deeper text?}
|
||
Idx --> NeedFull
|
||
NeedFull -->|no| Ans[Answer with citations]
|
||
NeedFull -->|yes| JSONL[Vector search filtered llms-full.jsonl by category or id]
|
||
JSONL --> Ans
|
||
```
|
||
|
||
---
|
||
|
||
## Boundaries
|
||
|
||
- RAG over Wormhole docs improves **Wormhole** answers; it does **not** override [EXPLORER_TOKEN_LIST_CROSSCHECK.md](../11-references/EXPLORER_TOKEN_LIST_CROSSCHECK.md) or CCIP runbooks for **Chain 138** deployment truth.
|
||
- If a user question mixes both (e.g. “bridge USDC to Chain 138 via Wormhole”), answer in **two explicit sections**: Wormhole mechanics vs this repo’s CCIP / 138 facts.
|
||
|
||
---
|
||
|
||
## Re-sync and audit
|
||
|
||
- After `sync-wormhole-ai-resources.sh`, commit or archive **`third-party/wormhole-ai-docs/manifest.json`** when you want a recorded snapshot (hashes per file).
|
||
- Rebuild or delta-update the vector index when `manifest.json` changes.
|