Voice Interface (VUI)
Orchestrate STT, LLM, and TTS to build full voice conversational agents.
Building a Voice User Interface (VUI) requires chaining three distinct Addis AI capabilities into a single, cohesive loop. This guide demonstrates how to build a server-side orchestrator that takes user audio and returns an AI voice response.
The Voice Pipeline
A typical voice interaction follows a request-response cycle known as the "Voice Loop".
- Transcribe (STT): Convert user audio (Amharic/Oromo) into text.
- Reason (LLM): Send that text to the Chat API to get an intelligent response.
- Speak (TTS): Convert the AI's text response back into audio.
Server-Side Orchestrator
To minimize latency and manage secrets, this pipeline should run on your server. The client sends one audio file, and the server returns the audio response (and text transcript).
This Express.js route handles the entire pipeline in one request.
const express = require('express');
const multer = require('multer');
const FormData = require('form-data');
const fetch = require('node-fetch');
const app = express();
const upload = multer(); // Memory storage
const API_KEY = process.env.ADDIS_AI_KEY;
app.post('/api/voice-chat', upload.single('audio'), async (req, res) => {
try {
// --- STEP 1: SPEECH TO TEXT ---
const sttForm = new FormData();
sttForm.append('audio', req.file.buffer, { filename: 'input.wav', contentType: req.file.mimetype });
sttForm.append('request_data', JSON.stringify({ language_code: 'am' }));
const sttRes = await fetch("https://api.addisassistant.com/api/v2/stt", {
method: "POST",
headers: { "x-api-key": API_KEY, ...sttForm.getHeaders() },
body: sttForm
});
const sttData = await sttRes.json();
const userText = sttData.data.transcription;
// --- STEP 2: TEXT GENERATION (LLM) ---
const chatRes = await fetch("https://api.addisassistant.com/api/v1/chat_generate", {
method: "POST",
headers: { "Content-Type": "application/json", "X-API-Key": API_KEY },
body: JSON.stringify({
model: "Addis-፩-አሌፍ",
prompt: userText,
target_language: "am"
})
});
const chatData = await chatRes.json();
const aiText = chatData.response_text;
// --- STEP 3: TEXT TO SPEECH (TTS) ---
const ttsRes = await fetch("https://api.addisassistant.com/api/v1/audio", {
method: "POST",
headers: { "Content-Type": "application/json", "X-API-Key": API_KEY },
body: JSON.stringify({
text: aiText,
language: "am"
})
});
const ttsData = await ttsRes.json();
// Return everything to the frontend
res.json({
user_transcript: userText,
ai_text: aiText,
audio_base64: ttsData.audio
});
} catch (error) {
console.error(error);
res.status(500).json({ error: "Voice pipeline failed" });
}
});import requests
import json
API_KEY = "sk_YOUR_KEY"
BASE_URL = "https://api.addisassistant.com"
def run_voice_pipeline(audio_file_path):
# 1. Transcribe (STT)
files = [('audio', ('input.wav', open(audio_file_path, 'rb'), 'audio/wav'))]
stt_payload = {'request_data': json.dumps({"language_code": "am"})}
stt_res = requests.post(f"{BASE_URL}/api/v2/stt", headers={"x-api-key": API_KEY}, files=files, data=stt_payload)
user_text = stt_res.json()['data']['transcription']
print(f"User said: {user_text}")
# 2. Chat (LLM)
chat_payload = {
"model": "Addis-፩-አሌፍ",
"prompt": user_text,
"target_language": "am"
}
chat_res = requests.post(f"{BASE_URL}/api/v1/chat_generate", headers={"X-API-Key": API_KEY}, json=chat_payload)
ai_text = chat_res.json()['response_text']
print(f"AI replied: {ai_text}")
# 3. Speak (TTS)
tts_payload = {
"text": ai_text,
"language": "am"
}
tts_res = requests.post(f"{BASE_URL}/api/v1/audio", headers={"X-API-Key": API_KEY}, json=tts_payload)
audio_base64 = tts_res.json()['audio']
return audio_base64Frontend Implementation
On the client side (React, Flutter, etc.), your job is to record audio, send it to your new orchestrator endpoint, and play the result.
Record Audio
Use a library like MediaRecorder (Web) or flutter_sound (Mobile) to capture user input.
- Format: WAV or MP3.
- Sample Rate: 16kHz is sufficient for speech.
Send to Server
Upload the blob to your /api/voice-chat endpoint.
const formData = new FormData();
formData.append('audio', audioBlob, 'input.wav');
const res = await fetch('/api/voice-chat', { method: 'POST', body: formData });
const data = await res.json();Play Response
Decode the Base64 audio string and play it immediately.
const audio = new Audio("data:audio/wav;base64," + data.audio_base64);
audio.play();Latency & Optimization
The "Request-Response" model adds up latency (STT time + LLM time + TTS time). To build a truly conversational experience, you should consider these optimizations.
Parallel Execution
Don't wait for full text. In advanced setups, stream the text from the LLM and start generating TTS audio for the first sentence while the rest of the text is still generating.
Realtime API
The Ultimate Solution. If latency is critical (e.g., live customer support), switch to our Realtime API. It handles STT, Logic, and TTS on the server over a single WebSocket connection with sub-300ms latency.
VAD (Voice Activity)
Implement Voice Activity Detection on the client. Only stop recording when the user has been silent for 500ms-1000ms. Sending silence to the API wastes time and money.
Context
Remember to store the user_text and ai_text in your database or session. You will need to pass this array to the conversation_history parameter in the next turn to keep the conversation going.