IVR, Voice Bot & Interpreter¶
Overview¶
The IVR system (Interactive Voice Response) forms the first point of contact for incoming calls. It combines classic DTMF navigation with AI-based voice recognition and provides access to the Voice Bot as well as real-time interpreter.
IVR system¶
Call flow¶
Eingehender Anruf
│
▼
DID-Lookup (Platform API)
│
▼
IVR-Menue zugeordnet?
├── Ja → Personalisierte Begruessung (TTS)
│ │
│ ▼
│ Warte auf Eingabe (DTMF + Sprache parallel)
│ │
│ ▼
│ Aktion ausfuehren
│
└── Nein → Fallback (Extension / Ring Group)
DTMF menu¶
| Button | Action |
|---|---|
| 0 | Central (Ring Group) |
| 1 | Voice Bot (LiveKit Agent) |
| 2 | Forwarding to employees |
| 3 | Record message |
| 4 | Start interpreter |
| 9 | Repeat the menu |
Language navigation¶
Parallel to DTMF, the caller's speech input is analyzed:
Anrufer spricht
│
▼
audio_fork → WebSocket /ivr
│
▼
STT (Spracherkennung)
│
▼
Intent-Erkennung → Aktion
The voice recognition runs throughaudio_fork, which forwards the audio stream to a WebSocket endpoint in real time. The Intent detection maps spoken keywords on DTMF actions.
IVR-Menue Administration¶
# IVR-Menues verwalten
GET /api/v1/sip/ivr-menus
POST /api/v1/sip/ivr-menus
{
"name": "Hauptmenue",
"greeting_tts": "Willkommen bei xynap. Druecken Sie 1 fuer...",
"timeout": 10,
"max_retries": 3,
"actions": [
{"digit": "0", "type": "ring_group", "target_id": 1},
{"digit": "1", "type": "livekit_agent", "target_id": null},
{"digit": "2", "type": "extension", "target_id": 1000},
{"digit": "4", "type": "interpreter", "target_id": null}
]
}
TTS (Text to Speech)¶
ElevenLabs — IVR statements¶
ElevenLabsis used for IVR greetings and menu announcements (high quality, naturally sounding voices).
- Einsatz: Personalized greetings, menu announcements
- Cache:
/var/lib/xynap/voicebot/tts-audio/ivr_greetings/ - Format: WAV, 8kHz/16bit (SIP compatible)
TTS-Caching
Generated audio files are checked so that identical announcements do not have to be generated again. When the greeting text is changed, the cache is automatically invalidated.
Piper — Interpreter-TTS¶
Piper(local TTS) is used for the interpreter service:
- Einsatz: Real-time translation output
- Vorteil: No API Layer, Run locally
- Modelle: German and English voices preinstalled
LiveKit Voice Bot¶
Architecture¶
FreeSwitch
│ SIP INVITE an *99 oder IVR-Taste 1
▼
sofia/external-ipv4/test@127.0.0.1:5070
│
▼
LiveKit SIP Bridge (livekit-sip)
│ Erstellt LiveKit Room
▼
LiveKit Server (livekit)
│ Dispatcht Agent
▼
LiveKit Agent Worker (livekit-agent)
│ Python, livekit-agents 1.4
▼
KI-Interaktion (STT → LLM → TTS)
Configuration¶
| Component | Config file | Containers |
|---|---|---|
| LiveKit Server | /etc/xynap/livekit/livekit.yaml |
livekit |
| SIP Bridge | /etc/xynap/livekit-sip/sip.yaml |
livekit-sip |
| Agent | Source code in container | livekit-agent |
SIP Trunk and Dispatch¶
| Parameters | Value |
|---|---|
| SIP Trunk ID | ST_kwNbrEg4YHSv |
| Dispatch Rule ID | SDR_k4XfgYznebZA |
| SIP target | 127.0.0.1:5070 |
| Test extension | *99 |
Agent Worker¶
The Agent Worker is a Python process based onlivekit-agents 1.4:
- STT(Speech-to-Text) — Transcription of caller language
- LLM— Processing and response generation
- TTS— Language output of the answer
The agent will start automatically when a new participant enters the LiveKit-Room (Dispatch Rule).
Interpreter (real-time translation)¶
Overview¶
The interpreter allows bidirectional real-time translation during a call. It is activated viaIVR-Taste 4.
Pipeline¶
Anrufer spricht (Deutsch)
│
▼
Whisper STT (Transkription)
│ Text (DE)
▼
LibreTranslate (Uebersetzung DE → EN)
│ Text (EN)
▼
Ollama (optionale Nachbearbeitung/Kontextverstaendnis)
│ Text (EN, optimiert)
▼
Piper TTS (Sprachausgabe)
│ Audio (EN)
▼
Ausgabe an Gegenpartei
The same pipeline runs in reverse direction for the counterparty.
Technical details¶
| Component | Description |
|---|---|
| Containers | interpreter-bridge(host network) |
| Source code | /usr/local/xynap/interpreter/ |
| WebSocket | ws://127.0.0.1:9001/interpret |
| STT | Whisper (OpenAI) |
| Translation | LibreTranslate (local) |
| LLM | Ollama (local, optional post-processing) |
| TTS | Piper (local) |
Audio flow¶
Teilnehmer A Teilnehmer B
│ │
│ Audio (DE) │
├──────► audio_fork ────────────►│
│ │ │
│ ▼ │
│ Whisper STT │
│ │ │
│ ▼ │
│ LibreTranslate │
│ │ │
│ ▼ │
│ Piper TTS (EN) │
│ │ │
│ └──────────► Audio (EN) │
│ │
│ Audio (EN) │
│◄──── gleiche Pipeline ────────┤
│ (umgekehrt) │
Latenz
The end-to-end charge of the transmission pipeline is typically 2–4 seconds, depending on the set length and GPU load. Whisper and Piper run on the local RTX 4000 Ada (20 GB VRAM).
GPU resources
Whisper, Ollama and Piper share the GPU. With the simultaneous use of all services, bottlenecks can occur. The interpreter is currently a priority for the GPU resources.
Interplay of components¶
┌─────────────────┐
│ Platform API │
│ (SIP-Modul) │
└────────┬────────┘
│ xml-curl
▼
┌─────────┐ ┌────────────────────────┐ ┌─────────────┐
│ SIP │────►│ FreeSwitch │────►│ LiveKit │
│Provider │ │ │ │ SIP Bridge │
└─────────┘ │ IVR ──► DTMF/Sprache │ └──────┬──────┘
│ │ │ │
│ ┌────┴────┐ │ ┌──────┴──────┐
│ │ Taste 1 │─────────┼────►│ Voice Bot │
│ │ Taste 4 │─────┐ │ │ (Agent) │
│ └─────────┘ │ │ └─────────────┘
└────────────────────┼───┘
│
┌──────┴──────┐
│ Interpreter │
│ Bridge │
└──────┬──────┘
│
┌───────┬───────┼───────┐
▼ ▼ ▼ ▼
Whisper Libre Ollama Piper
STT Translate LLM TTS