Speech to Text (Transcription)

Our Speech-to-Text (STT) technology is powered by specialized models trained to accurately recognize Ethiopian languages. Unlike generic transcription services, our models are optimized for the specific phonemes, accents, and dialects of Amharic and Afan Oromo.

This endpoint requires a multipart/form-data request. You must send the audio file binary along with a JSON string containing the configuration.

Transcribe a File

Upload an audio file to get the text transcription.

Note: 'request_data' must be a stringified JSON object

curl --location 'https://api.addisassistant.com/api/v2/stt' \
  --header 'x-api-key: sk_YOUR_KEY' \
  --form 'audio=@"/path/to/voice_note.wav"' \
  --form 'request_data="{ \"language_code\": \"am\" }"'

const formData = new FormData();

// 1. Append the file
// Assuming 'fileInput' is an HTML <input type="file">
formData.append("audio", fileInput.files[0]); 

// 2. Append metadata as a stringified JSON
formData.append("request_data", JSON.stringify({ 
  language_code: "am" 
}));

const response = await fetch("https://api.addisassistant.com/api/v2/stt", {
  method: "POST",
  headers: {
    "x-api-key": "sk_YOUR_KEY"
    // Do NOT set Content-Type header manually for FormData
    // The browser sets it automatically with the boundary
  },
  body: formData
});

const data = await response.json();
console.log("Transcription:", data.data.transcription);

import requests
import json

url = "https://api.addisassistant.com/api/v2/stt"
headers = {"x-api-key": "sk_YOUR_KEY"}

# 1. Prepare Metadata
payload = {
    'request_data': json.dumps({ "language_code": "am" })
}

# 2. Open File
files = [
    ('audio', ('voice.wav', open('/path/to/voice.wav', 'rb'), 'audio/wav'))
]

response = requests.post(url, headers=headers, data=payload, files=files)

print(response.json()['data']['transcription'])

API Reference

Form Data Parameters

These fields are sent as multipart form data.

Prop

Type

Request Data Object

These parameters go inside the request_data JSON string.

Prop

Type

Response Schema

{
  "status": "success",
  "data": {
    "transcription": "ሰላም እንኳን ደህና መጣችሁ",
    "usage_metadata": {
      "totalBilledDuration": "15s",
      "requestId": "69b60667-0000-2a1e-b6d3-d4f547fe6724"
    }
  },
  "confidence": 0.982
}

Supported Formats

We support standard audio containers. For the fastest processing, we recommend WAV.

Format	Content Types (MIME)
WAV	`audio/wav`, `audio/x-wav`, `audio/wave`
MP3	`audio/mpeg`, `audio/mp3`
M4A	`audio/mp4`, `audio/x-m4a`
WebM	`audio/webm`

Best Practices

To ensure high accuracy (WER < 10%), follow these recording guidelines.

Audio Specs

Sample Rate: 16kHz or higher is recommended for clarity.

Channel: Mono is preferred. Stereo files are supported but mixed down before processing.

Environment

Noise: Background noise significantly degrades accuracy. Record in quiet environments.

Distance: Keep the speaker 10-30cm from the microphone for optimal volume levels.

Constraints

Max Duration60 Seconds

Max File Size10 MB

Limitations

Speakers: The model is optimized for single-speaker audio. Overlapping voices may result in skipped words.

Technical Terms: Rare technical jargon or code-switching (mixing English heavily) may have lower accuracy.