STT Endpoint
Configure the Speech-to-Text (STT) endpoint for transcription.
Overview
Intervu uses an OpenAI-compatible STT endpoint to transcribe audio in real-time. We recommend Speaches for local, private transcription.
For speaker diarization (multi-speaker detection), we recommend WhisperX ASR Service which provides reliable speaker embeddings.
Reference Links
- Speaches Installation Guide — Official documentation
- WhisperX ASR Service Docker Hub — Container images
Configuration
Open Settings (gear icon) and locate the STT section:
| Setting | Description | Example |
|---|---|---|
| STT Endpoint | URL for the STT API | http://localhost:8000/v1/audio/transcriptions |
| STT Model | Model identifier | Systran/faster-distil-whisper-small.en |
| STT API Key | API key (if required) | Leave empty for Speaches |
Default Values
STT Endpoint: http://localhost:8000/v1/audio/transcriptions
STT Model: Systran/faster-distil-whisper-small.en
STT API Key: (empty)Testing the Connection
After configuring the endpoint:
- Click Test STT button in Settings
- Intervu sends a silent audio sample to verify the connection
- A success message confirms the endpoint is working
Test First
Always test the STT endpoint before starting an interview. This catches configuration errors early.
Backend Options
Auto-Detection
Intervu automatically detects the STT backend type based on the endpoint URL:
| Endpoint Pattern | Backend Type | Notes |
|---|---|---|
/asr | whisperx-asr-service | Recommended for diarization |
/v1/audio/transcriptions | OpenAI-compatible | Standard Whisper API format |
OpenAI-Compatible (Speaches)
Recommended for: Standard transcription, single speaker
services:
speaches:
image: ghcr.io/speaches-ai/speaches:latest-cuda
container_name: speaches
ports:
- 8000:8000
volumes:
- hf-hub-cache:/home/ubuntu/.cache/huggingface/hub
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities:
- gpu
volumes:
hf-hub-cache:Prerequisites:
- NVIDIA Container Toolkit installed
Endpoint: http://localhost:8000/v1/audio/transcriptions
WhisperX ASR Service
Recommended for: Speaker diarization, panel interviews
services:
whisperx-asr:
image: learnedmachine/whisperx-asr-service
container_name: whisperx-asr
ports:
- "${PORT:-9000}:9000"
environment:
- ASR_MODEL=${ASR_MODEL:-large-v3}
- ASR_ENGINE=${ASR_ENGINE:-faster_whisper}
- HF_TOKEN=${HF_TOKEN:?HF_TOKEN is required for diarization}
volumes:
- hf_cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD-SHELL", "curl --fail --connect-timeout 5 http://localhost:9000/docs || exit 1"]
interval: 30s
timeout: 10s
retries: 5
start_period: 120s
restart: unless-stopped
volumes:
hf_cache:Prerequisites:
- NVIDIA Container Toolkit installed
- Hugging Face token with access to diarization models (see below)
HuggingFace Token Setup:
Your HF_TOKEN must have access to these three HuggingFace models:
- pyannote/speaker-diarization-3.1 — Accept the license agreement
- pyannote/segmentation-3.0 — Accept the license agreement
- pyannote/speaker-diarization-community-1 — Accept the license agreement
Visit each model page above and click "Access repository" or "Accept license" to grant your token access.
Setup:
Create a .env file in the same directory as your compose.yaml:
# .env
HF_TOKEN=your_huggingface_token_here
PORT=9000
ASR_MODEL=large-v3
ASR_ENGINE=faster_whisperThen deploy:
docker compose pull
docker compose up -dEndpoint: http://localhost:9000/asr
Why whisperx-asr-service?
- Returns speaker embeddings for cross-chunk voice tracking
- Proven by the Speakr project for reliable diarization
- Compatible with Intervu's speaker monitoring features
Model Selection
Choose a model based on your needs:
| Model | Speed | Accuracy | VRAM | Languages |
|---|---|---|---|---|
Systran/faster-distil-whisper-small.en | ⚡ Fast | Good | ~1GB | English only |
Systran/faster-distil-whisper-medium.en | Medium | Better | ~2GB | English only |
Systran/faster-whisper-small.en | Fast | Good | ~1GB | Multilingual |
openai/whisper-medium | Slow | Best | ~5GB | Multilingual |
Recommendations
- Real-time interviews: Use
faster-distil-whisper-small.enfor best speed - Non-English: Use
faster-whisper-small.enor multilingual models - Maximum accuracy: Use
whisper-mediumorwhisper-large - Speaker diarization: Use
whisperx-asr-servicewith any model
Advanced STT Settings
In Settings → Advanced:
Silence RMS Threshold
Audio below this level is treated as silence and not sent to STT.
- Default:
0.005 - Higher: More aggressive filtering (may miss quiet speech)
- Lower: Less filtering (may send noise to STT)
Audio Chunk Duration
Length of audio chunks sent to STT.
- Default:
3seconds - Lower: Faster response, but more API calls
- Higher: Fewer API calls, but slower response
- With diarization: Minimum 6 seconds (enforced automatically)
Hallucination Phrases
Comma-separated phrases that should be ignored.
- Default:
you,thank you,thanks,thanks for watching,thank you for watching,thanks for listening,thank you for listening,bye,goodbye,the end,so,okay,hmm,uh,um,oh,ah,i - Purpose: Whisper sometimes hallucinates these phrases from silence
Using Other STT Services
Intervu supports any OpenAI-compatible STT endpoint.
OpenAI Whisper API
Endpoint: https://api.openai.com/v1/audio/transcriptions
Model: whisper-1
API Key: sk-...Azure Speech Services
Endpoint: https://your-region.api.cognitive.microsoft.com/openai/deployments/whisper/audio/transcriptions?api-version=2024-02-15-preview
Model: whisper
API Key: your-azure-keyCustom Endpoint Requirements
Your endpoint must accept:
- POST request to
/v1/audio/transcriptions - Multipart form with audio file
modelparameter- Return JSON:
{ text: "transcribed text" }
Troubleshooting
Connection Failed
- Verify STT service is running:
docker ps - Check the endpoint URL is correct
- Try
curl http://localhost:8000/v1/models
Slow Transcription
- Use a smaller model
- Ensure GPU acceleration is working
- Check logs:
docker logs <container_name>
Poor Accuracy
- Speak clearly and close to microphone
- Try a larger model
- Adjust silence threshold in Advanced settings
Hallucinated Text
- Check hallucination phrases in Advanced settings
- Add common false positives to the list
Diarization Not Working
- Verify you're using a diarization-compatible backend (
whisperx-asr-service) - Check the endpoint URL uses
/asrpath - Disable and re-enable diarization to trigger validation
- Check backend logs for errors
Next Steps
- LLM Endpoint — Configure answer generation
- Advanced Settings — Fine-tune behavior
- Speaker Diarization — Multi-speaker setup