STT Endpoint

Configure the Speech-to-Text (STT) endpoint for transcription.

Overview

Intervu uses an OpenAI-compatible STT endpoint to transcribe audio in real-time. We recommend Speaches for local, private transcription.

For speaker diarization (multi-speaker detection), we recommend WhisperX ASR Service which provides reliable speaker embeddings.

Reference Links

Speaches Installation Guide — Official documentation
WhisperX ASR Service Docker Hub — Container images

Configuration

Open Settings (gear icon) and locate the STT section:

Setting	Description	Example
STT Endpoint	URL for the STT API	`http://localhost:8000/v1/audio/transcriptions`
STT Model	Model identifier	`Systran/faster-distil-whisper-small.en`
STT API Key	API key (if required)	Leave empty for Speaches

Default Values

STT Endpoint: http://localhost:8000/v1/audio/transcriptions
STT Model: Systran/faster-distil-whisper-small.en
STT API Key: (empty)

Testing the Connection

After configuring the endpoint:

Click Test STT button in Settings
Intervu sends a silent audio sample to verify the connection
A success message confirms the endpoint is working

Test First

Always test the STT endpoint before starting an interview. This catches configuration errors early.

Backend Options

Auto-Detection

Intervu automatically detects the STT backend type based on the endpoint URL:

Endpoint Pattern	Backend Type	Notes
`/asr`	whisperx-asr-service	Recommended for diarization
`/v1/audio/transcriptions`	OpenAI-compatible	Standard Whisper API format

OpenAI-Compatible (Speaches)

Recommended for: Standard transcription, single speaker

yaml

services:
  speaches:
    image: ghcr.io/speaches-ai/speaches:latest-cuda
    container_name: speaches
    ports:
      - 8000:8000
    volumes:
      - hf-hub-cache:/home/ubuntu/.cache/huggingface/hub
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities:
                - gpu

volumes:
  hf-hub-cache:

Prerequisites:

NVIDIA Container Toolkit installed

Endpoint: http://localhost:8000/v1/audio/transcriptions

WhisperX ASR Service

Recommended for: Speaker diarization, panel interviews

yaml

services:
  whisperx-asr:
    image: learnedmachine/whisperx-asr-service
    container_name: whisperx-asr
    ports:
      - "${PORT:-9000}:9000"
    environment:
      - ASR_MODEL=${ASR_MODEL:-large-v3}
      - ASR_ENGINE=${ASR_ENGINE:-faster_whisper}
      - HF_TOKEN=${HF_TOKEN:?HF_TOKEN is required for diarization}
    volumes:
      - hf_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD-SHELL", "curl --fail --connect-timeout 5 http://localhost:9000/docs || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s
    restart: unless-stopped

volumes:
  hf_cache:

Prerequisites:

NVIDIA Container Toolkit installed
Hugging Face token with access to diarization models (see below)

HuggingFace Token Setup:

Your HF_TOKEN must have access to these three HuggingFace models:

pyannote/speaker-diarization-3.1 — Accept the license agreement
pyannote/segmentation-3.0 — Accept the license agreement
pyannote/speaker-diarization-community-1 — Accept the license agreement

Visit each model page above and click "Access repository" or "Accept license" to grant your token access.

Setup:

Create a .env file in the same directory as your compose.yaml:

bash

# .env
HF_TOKEN=your_huggingface_token_here
PORT=9000
ASR_MODEL=large-v3
ASR_ENGINE=faster_whisper

Then deploy:

bash

docker compose pull
docker compose up -d

Endpoint: http://localhost:9000/asr

Why whisperx-asr-service?

Returns speaker embeddings for cross-chunk voice tracking
Proven by the Speakr project for reliable diarization
Compatible with Intervu's speaker monitoring features

Model Selection

Choose a model based on your needs:

Model	Speed	Accuracy	VRAM	Languages
`Systran/faster-distil-whisper-small.en`	⚡ Fast	Good	~1GB	English only
`Systran/faster-distil-whisper-medium.en`	Medium	Better	~2GB	English only
`Systran/faster-whisper-small.en`	Fast	Good	~1GB	Multilingual
`openai/whisper-medium`	Slow	Best	~5GB	Multilingual

Recommendations

Real-time interviews: Use faster-distil-whisper-small.en for best speed
Non-English: Use faster-whisper-small.en or multilingual models
Maximum accuracy: Use whisper-medium or whisper-large
Speaker diarization: Use whisperx-asr-service with any model

Advanced STT Settings

In Settings → Advanced:

Silence RMS Threshold

Audio below this level is treated as silence and not sent to STT.

Default: 0.005
Higher: More aggressive filtering (may miss quiet speech)
Lower: Less filtering (may send noise to STT)

Audio Chunk Duration

Length of audio chunks sent to STT.

Default: 3 seconds
Lower: Faster response, but more API calls
Higher: Fewer API calls, but slower response
With diarization: Minimum 6 seconds (enforced automatically)

Hallucination Phrases

Comma-separated phrases that should be ignored.

Default: you,thank you,thanks,thanks for watching,thank you for watching,thanks for listening,thank you for listening,bye,goodbye,the end,so,okay,hmm,uh,um,oh,ah,i
Purpose: Whisper sometimes hallucinates these phrases from silence

Using Other STT Services

Intervu supports any OpenAI-compatible STT endpoint.

OpenAI Whisper API

bash

Endpoint: https://api.openai.com/v1/audio/transcriptions
Model: whisper-1
API Key: sk-...

Azure Speech Services

bash

Endpoint: https://your-region.api.cognitive.microsoft.com/openai/deployments/whisper/audio/transcriptions?api-version=2024-02-15-preview
Model: whisper
API Key: your-azure-key

Custom Endpoint Requirements

Your endpoint must accept:

POST request to /v1/audio/transcriptions
Multipart form with audio file
model parameter
Return JSON: { text: "transcribed text" }

Troubleshooting

Connection Failed

Verify STT service is running: docker ps
Check the endpoint URL is correct
Try curl http://localhost:8000/v1/models

Slow Transcription

Use a smaller model
Ensure GPU acceleration is working
Check logs: docker logs <container_name>

Poor Accuracy

Speak clearly and close to microphone
Try a larger model
Adjust silence threshold in Advanced settings

Hallucinated Text

Check hallucination phrases in Advanced settings
Add common false positives to the list

Diarization Not Working

Verify you're using a diarization-compatible backend (whisperx-asr-service)
Check the endpoint URL uses /asr path
Disable and re-enable diarization to trigger validation
Check backend logs for errors

Next Steps

LLM Endpoint — Configure answer generation
Advanced Settings — Fine-tune behavior
Speaker Diarization — Multi-speaker setup

STT Endpoint ​

Overview ​

Reference Links ​

Configuration ​

Default Values ​

Testing the Connection ​

Backend Options ​

Auto-Detection ​

OpenAI-Compatible (Speaches) ​

WhisperX ASR Service ​

Model Selection ​

Recommendations ​

Advanced STT Settings ​

Silence RMS Threshold ​

Audio Chunk Duration ​

Hallucination Phrases ​

Using Other STT Services ​

OpenAI Whisper API ​

Azure Speech Services ​

Custom Endpoint Requirements ​

Troubleshooting ​

Connection Failed ​

Slow Transcription ​

Poor Accuracy ​

Hallucinated Text ​

Diarization Not Working ​

Next Steps ​

STT Endpoint

Overview

Reference Links

Configuration

Default Values

Testing the Connection

Backend Options

Auto-Detection

OpenAI-Compatible (Speaches)

WhisperX ASR Service

Model Selection

Recommendations

Advanced STT Settings

Silence RMS Threshold

Audio Chunk Duration

Hallucination Phrases

Using Other STT Services

OpenAI Whisper API

Azure Speech Services

Custom Endpoint Requirements

Troubleshooting

Connection Failed

Slow Transcription

Poor Accuracy

Hallucinated Text

Diarization Not Working

Next Steps