Bring Your Own LLM (BYOL) - Vaani Backend API

Overview

BYOL lets you replace the default platform LLM with any inference engine you control — an on-premise model, a fine-tuned model, a Google ADK agent, a LangGraph workflow, or anything else that can speak the Vaani WebSocket protocol. When BYOL is enabled for an agent, every response turn is routed to your WebSocket server instead of the built-in provider (OpenAI, Google, Groq, etc.). The rest of the pipeline — STT, TTS, telephony, transcripts, Langfuse tracing — stays exactly the same.

How to enable BYOL

Open your agent in the dashboard.
Go to Brain → Reasoning Language Model (LLM).
Switch to the Bring your Own LLM (BYOL) tab.
Paste your WebSocket URL (e.g. wss://your-server.example.com/chat/stream).
Click Test Connection to verify reachability, then Save URL.
Choose a Fallback LLM — either No fallback or Use platform LLM.

The URL is stored under agent_config.persona.senses_capabilities.brain.llm.extra_params.llm_websocket_url and takes effect immediately on the next call.

WebSocket protocol

Your server must implement the following JSON message exchange over a persistent WebSocket connection. Vaani opens one connection per call (identified by session_id / room name) and sends one request per agent turn.

Connection handshake

Immediately after the WebSocket is accepted, your server must send two JSON frames in order:

{ "interaction_type": "config",    "content": "Server ready" }
{ "interaction_type": "greeting",  "content": "Hello" }

These frames are consumed by the Vaani agent and discarded — they are only used to confirm the connection is live. The content strings may be anything.

Agent → Your server (request)

For every agent turn (after the user finishes speaking) Vaani sends:

{
  "interaction_type": "response_required",
  "response_id": 1,
  "transcript": [
    { "role": "system", "content": "You are a pen salesman …" },
    { "role": "user", "content": "What pens do you sell?" },
    { "role": "assistant", "content": "We carry …" },
    { "role": "user", "content": "Do you have fountain pens?" }
  ]
}

Field	Type	Description
`interaction_type`	`"response_required"`	Always this value for a normal turn
`response_id`	integer	Monotonically increasing; echoed back in every response chunk
`transcript`	array	Full conversation history for this call, including the system prompt as the first `"system"` message

The system prompt built from your agent configuration is always the first entry with "role": "system". Your server must apply it as the LLM’s instruction/system message for each turn. If you create an ADK LlmAgent or a LangChain chain, pass this text as the agent instruction or system message so the configured persona is honoured.

Your server → Agent (streaming response)

Stream back one or more chunks, each as a JSON frame:

{
  "response_type": "response",
  "response_id": 1,
  "content": "Yes, we carry ",
  "content_complete": false
}

Send a final frame with "content_complete": true and optionally empty content to signal end of turn:

{
  "response_type": "response",
  "response_id": 1,
  "content": "beautiful Pelikan M800 models.",
  "content_complete": true
}

Field	Type	Description
`response_type`	`"response"`	Always this value
`response_id`	integer	Must match the `response_id` from the request
`content`	string	Text chunk to speak; may be empty on the final frame
`content_complete`	boolean	`true` on the last frame of a turn

You may optionally add "end_call": true on the final frame to signal that the agent should hang up the call after speaking the response.

Keep-alive (ping/pong)

If your server sends a keep-alive frame Vaani will echo it back immediately:

{ "response_type": "ping_pong" }

Fallback behaviour

You can configure what happens when your server is unreachable or returns an error:

Setting	Behaviour
No fallback	The turn fails silently; the agent waits for the next user utterance
Use platform LLM	The agent falls back to the primary provider configured in the From Providers tab; if that also fails, it tries the fallback provider

The fallback mode is stored as extra_params.fallback in the agent config ("none" or "platform").

Session management

Each call opens a new connection at <ws_url>/<call_id> (the call_id / room name is appended automatically). Your server should use the session_id or the path component to isolate per-call state (e.g. conversation memory, tool state). When the call ends Vaani closes the WebSocket cleanly. You can also expose a DELETE /session/{session_id} endpoint on your server so Vaani can explicitly clean up state (see the example ADK server).

Reference implementation

The vaani-adk-byol-example repository contains a complete FastAPI server (adk_server.py) that:

Implements the full WebSocket protocol above
Runs a Google ADK agent (LlmAgent) with the system prompt injected per session
Exposes REST (/chat), SSE (/chat/sse), and WebSocket (/chat/stream/{session_id}) endpoints
Can be deployed locally and exposed with ngrok in under 5 minutes

# Install
cd vaani-adk-byol-example
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp example.env .env   # add your GOOGLE_API_KEY

# Run
uvicorn adk_server:app --host 0.0.0.0 --port 8090 --reload

# Expose publicly
ngrok http 8090
# → copy wss://xxxx.ngrok-free.app and paste into the BYOL tab

​Overview

​How to enable BYOL

​WebSocket protocol

​Connection handshake

​Agent → Your server (request)

​Your server → Agent (streaming response)

​Keep-alive (ping/pong)

​Fallback behaviour

​Session management

​Reference implementation

Overview

How to enable BYOL

WebSocket protocol

Connection handshake

Agent → Your server (request)

Your server → Agent (streaming response)

Keep-alive (ping/pong)

Fallback behaviour

Session management

Reference implementation