Skip to content

STT Endpoint

Configure the Speech-to-Text (STT) endpoint for transcription.

Overview

Intervu uses an OpenAI-compatible STT endpoint to transcribe audio in real-time. We recommend Speaches for local, private transcription.


Configuration

Open Settings (gear icon) and locate the STT section:

SettingDescriptionExample
STT EndpointURL for the STT APIhttp://localhost:8000/v1/audio/transcriptions
STT ModelModel identifierSystran/faster-distil-whisper-small.en
STT API KeyAPI key (if required)Leave empty for Speaches

Default Values

STT Endpoint: http://localhost:8000/v1/audio/transcriptions
STT Model: Systran/faster-distil-whisper-small.en
STT API Key: (empty)

Testing the Connection

After configuring the endpoint:

  1. Click Test STT button in Settings
  2. Intervu sends a silent audio sample to verify the connection
  3. A success message confirms the endpoint is working

Test First

Always test the STT endpoint before starting an interview. This catches configuration errors early.


Model Selection

Choose a model based on your needs:

ModelSpeedAccuracyVRAMLanguages
Systran/faster-distil-whisper-small.en⚡ FastGood~1GBEnglish only
Systran/faster-distil-whisper-medium.enMediumBetter~2GBEnglish only
Systran/faster-whisper-small.enFastGood~1GBMultilingual
openai/whisper-mediumSlowBest~5GBMultilingual

Recommendations

  • Real-time interviews: Use faster-distil-whisper-small.en for best speed
  • Non-English: Use faster-whisper-small.en or multilingual models
  • Maximum accuracy: Use whisper-medium or whisper-large

Advanced STT Settings

In Settings → Advanced:

Silence RMS Threshold

Audio below this level is treated as silence and not sent to STT.

  • Default: 0.005
  • Higher: More aggressive filtering (may miss quiet speech)
  • Lower: Less filtering (may send noise to STT)

Audio Chunk Duration

Length of audio chunks sent to STT.

  • Default: 3 seconds
  • Lower: Faster response, but more API calls
  • Higher: Fewer API calls, but slower response

Hallucination Phrases

Comma-separated phrases that should be ignored.

  • Default: you,thank you,thanks,thanks for watching,thank you for watching,thanks for listening,thank you for listening,bye,goodbye,the end,so,okay,hmm,uh,um,oh,ah,i
  • Purpose: Whisper sometimes hallucinates these phrases from silence

Using Other STT Services

Intervu supports any OpenAI-compatible STT endpoint.

OpenAI Whisper API

bash
Endpoint: https://api.openai.com/v1/audio/transcriptions
Model: whisper-1
API Key: sk-...

Azure Speech Services

bash
Endpoint: https://your-region.api.cognitive.microsoft.com/openai/deployments/whisper/audio/transcriptions?api-version=2024-02-15-preview
Model: whisper
API Key: your-azure-key

Custom Endpoint Requirements

Your endpoint must accept:

  • POST request to /v1/audio/transcriptions
  • Multipart form with audio file
  • model parameter
  • Return JSON: { text: "transcribed text" }

Troubleshooting

Connection Failed

  • Verify Speaches is running: docker ps
  • Check the endpoint URL is correct
  • Try curl http://localhost:8000/v1/models

Slow Transcription

  • Use a smaller model
  • Ensure GPU acceleration is working
  • Check Speaches logs: docker logs speaches

Poor Accuracy

  • Speak clearly and close to microphone
  • Try a larger model
  • Adjust silence threshold in Advanced settings

Hallucinated Text

  • Check hallucination phrases in Advanced settings
  • Add common false positives to the list

Next Steps

Made with ❤️by Aldrick Bonaobra