Skip to content

STT Endpoint

Configure the Speech-to-Text (STT) endpoint for transcription.

Overview

Intervu uses an OpenAI-compatible STT endpoint to transcribe audio in real-time. We recommend Speaches for local, private transcription.

For speaker diarization (multi-speaker detection), we recommend WhisperX ASR Service which provides reliable speaker embeddings.


Configuration

Open Settings (gear icon) and locate the STT section:

SettingDescriptionExample
STT EndpointURL for the STT APIhttp://localhost:8000/v1/audio/transcriptions
STT ModelModel identifierSystran/faster-distil-whisper-small.en
STT API KeyAPI key (if required)Leave empty for Speaches

Default Values

STT Endpoint: http://localhost:8000/v1/audio/transcriptions
STT Model: Systran/faster-distil-whisper-small.en
STT API Key: (empty)

Testing the Connection

After configuring the endpoint:

  1. Click Test STT button in Settings
  2. Intervu sends a silent audio sample to verify the connection
  3. A success message confirms the endpoint is working

Test First

Always test the STT endpoint before starting an interview. This catches configuration errors early.


Backend Options

Auto-Detection

Intervu automatically detects the STT backend type based on the endpoint URL:

Endpoint PatternBackend TypeNotes
/asrwhisperx-asr-serviceRecommended for diarization
/v1/audio/transcriptionsOpenAI-compatibleStandard Whisper API format

OpenAI-Compatible (Speaches)

Recommended for: Standard transcription, single speaker

yaml
services:
  speaches:
    image: ghcr.io/speaches-ai/speaches:latest-cuda
    container_name: speaches
    ports:
      - 8000:8000
    volumes:
      - hf-hub-cache:/home/ubuntu/.cache/huggingface/hub
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities:
                - gpu

volumes:
  hf-hub-cache:

Prerequisites:

  • NVIDIA Container Toolkit installed

Endpoint: http://localhost:8000/v1/audio/transcriptions

WhisperX ASR Service

Recommended for: Speaker diarization, panel interviews

yaml
services:
  whisperx-asr:
    image: learnedmachine/whisperx-asr-service
    container_name: whisperx-asr
    ports:
      - "${PORT:-9000}:9000"
    environment:
      - ASR_MODEL=${ASR_MODEL:-large-v3}
      - ASR_ENGINE=${ASR_ENGINE:-faster_whisper}
      - HF_TOKEN=${HF_TOKEN:?HF_TOKEN is required for diarization}
    volumes:
      - hf_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD-SHELL", "curl --fail --connect-timeout 5 http://localhost:9000/docs || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s
    restart: unless-stopped

volumes:
  hf_cache:

Prerequisites:

  • NVIDIA Container Toolkit installed
  • Hugging Face token with access to diarization models (see below)

HuggingFace Token Setup:

Your HF_TOKEN must have access to these three HuggingFace models:

  1. pyannote/speaker-diarization-3.1 — Accept the license agreement
  2. pyannote/segmentation-3.0 — Accept the license agreement
  3. pyannote/speaker-diarization-community-1 — Accept the license agreement

Visit each model page above and click "Access repository" or "Accept license" to grant your token access.

Setup:

Create a .env file in the same directory as your compose.yaml:

bash
# .env
HF_TOKEN=your_huggingface_token_here
PORT=9000
ASR_MODEL=large-v3
ASR_ENGINE=faster_whisper

Then deploy:

bash
docker compose pull
docker compose up -d

Endpoint: http://localhost:9000/asr

Why whisperx-asr-service?

  • Returns speaker embeddings for cross-chunk voice tracking
  • Proven by the Speakr project for reliable diarization
  • Compatible with Intervu's speaker monitoring features

Model Selection

Choose a model based on your needs:

ModelSpeedAccuracyVRAMLanguages
Systran/faster-distil-whisper-small.en⚡ FastGood~1GBEnglish only
Systran/faster-distil-whisper-medium.enMediumBetter~2GBEnglish only
Systran/faster-whisper-small.enFastGood~1GBMultilingual
openai/whisper-mediumSlowBest~5GBMultilingual

Recommendations

  • Real-time interviews: Use faster-distil-whisper-small.en for best speed
  • Non-English: Use faster-whisper-small.en or multilingual models
  • Maximum accuracy: Use whisper-medium or whisper-large
  • Speaker diarization: Use whisperx-asr-service with any model

Advanced STT Settings

In Settings → Advanced:

Silence RMS Threshold

Audio below this level is treated as silence and not sent to STT.

  • Default: 0.005
  • Higher: More aggressive filtering (may miss quiet speech)
  • Lower: Less filtering (may send noise to STT)

Audio Chunk Duration

Length of audio chunks sent to STT.

  • Default: 3 seconds
  • Lower: Faster response, but more API calls
  • Higher: Fewer API calls, but slower response
  • With diarization: Minimum 6 seconds (enforced automatically)

Hallucination Phrases

Comma-separated phrases that should be ignored.

  • Default: you,thank you,thanks,thanks for watching,thank you for watching,thanks for listening,thank you for listening,bye,goodbye,the end,so,okay,hmm,uh,um,oh,ah,i
  • Purpose: Whisper sometimes hallucinates these phrases from silence

Using Other STT Services

Intervu supports any OpenAI-compatible STT endpoint.

OpenAI Whisper API

bash
Endpoint: https://api.openai.com/v1/audio/transcriptions
Model: whisper-1
API Key: sk-...

Azure Speech Services

bash
Endpoint: https://your-region.api.cognitive.microsoft.com/openai/deployments/whisper/audio/transcriptions?api-version=2024-02-15-preview
Model: whisper
API Key: your-azure-key

Custom Endpoint Requirements

Your endpoint must accept:

  • POST request to /v1/audio/transcriptions
  • Multipart form with audio file
  • model parameter
  • Return JSON: { text: "transcribed text" }

Troubleshooting

Connection Failed

  • Verify STT service is running: docker ps
  • Check the endpoint URL is correct
  • Try curl http://localhost:8000/v1/models

Slow Transcription

  • Use a smaller model
  • Ensure GPU acceleration is working
  • Check logs: docker logs <container_name>

Poor Accuracy

  • Speak clearly and close to microphone
  • Try a larger model
  • Adjust silence threshold in Advanced settings

Hallucinated Text

  • Check hallucination phrases in Advanced settings
  • Add common false positives to the list

Diarization Not Working

  • Verify you're using a diarization-compatible backend (whisperx-asr-service)
  • Check the endpoint URL uses /asr path
  • Disable and re-enable diarization to trigger validation
  • Check backend logs for errors

Next Steps

Made with ❤️by Aldrick Bonaobra