proxmox/docs/04-configuration/PHOENIX_TTS_API_CONTRACT.md

# Phoenix TTS API contract (ElevenLabs-compatible)

**Last Updated:** 2026-02-10
**Purpose:** So virtual-banker (and other apps) can “just change endpoint” from ElevenLabs to a Phoenix-hosted TTS service.

---

## Required endpoints

The Phoenix TTS service **must** implement the same HTTP contract as ElevenLabs for these paths (base path is the app’s `/tts` or similar; below uses prefix `/v1`).

### 1. Sync text-to-speech

- **Method:** `POST`
- **Path:** `/v1/text-to-speech/:voice_id`
- **Headers:**
  - `Content-Type: application/json`
  - `Accept: audio/mpeg`
  - Auth: either `xi-api-key: <key>` or `Authorization: Bearer <token>` (configurable in client)
- **Body (JSON):**
  ```json
  {
    "text": "Hello world",
    "model_id": "eleven_multilingual_v2",
    "voice_settings": {
      "stability": 0.5,
      "similarity_boost": 0.75,
      "style": 0,
      "use_speaker_boost": true
    }
  }
  ```
- **Response:** `200 OK`, body = raw **mp3** bytes (`audio/mpeg`).

### 2. Streaming text-to-speech

- **Method:** `POST`
- **Path:** `/v1/text-to-speech/:voice_id/stream`
- **Headers:** Same as sync.
- **Body:** Same JSON as sync.
- **Response:** `200 OK`, body = **streaming** mp3 (same format).

### 3. Health (recommended)

- **Method:** `GET`
- **Path:** `/health` (at same origin as the TTS base URL, e.g. `https://phoenix.example.com/tts/health` if base is `.../tts/v1`)
- **Response:** `200 OK` (body optional; used for readiness).

---

## Optional

- **Auth:** If Phoenix uses a different scheme (e.g. Bearer only), clients set `TTS_AUTH_HEADER_NAME` / `TTS_AUTH_HEADER_VALUE`; no API change.
- **Visemes:** For better lip-sync, a future endpoint could return phoneme/viseme timings; client would call it when available.

---

## Reference

- Virtual-banker TTS client: `virtual-banker/backend/tts` (see `backend/tts/README.md`).
- ElevenLabs TTS API: [Text-to-speech](https://elevenlabs.io/docs/api-reference/text-to-speech), [Stream](https://elevenlabs.io/docs/api-reference/text-to-speech/stream).