Speaker Diarization
Multi-speaker identification for panel interviews and multi-participant conversations.
Overview
Speaker diarization automatically identifies and labels different speakers in your interview audio. Instead of just "interviewer" and "you", Intervu can detect individual speakers like "John", "Sarah", or "Manager #1".
When to Use
Speaker diarization is useful for:
- Panel interviews — Multiple interviewers asking questions
- Group interviews — Several candidates being interviewed together
- Team meetings — Multiple participants in the conversation
- Complex Q&A — Following which interviewer asked what
Enabling Speaker Diarization
1. Prerequisites
You need an STT backend that supports diarization:
Recommended: learnedmachine/whisperx-asr-service
# docker/whisperx-asr/compose.yaml
services:
whisperx-asr:
image: learnedmachine/whisperx-asr-service
container_name: whisperx-asr
ports:
- "${PORT:-9000}:9000"
environment:
- ASR_MODEL=${ASR_MODEL:-large-v3}
- ASR_ENGINE=${ASR_ENGINE:-faster_whisper}
- HF_TOKEN=${HF_TOKEN:?HF_TOKEN is required for diarization}
volumes:
- hf_cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD-SHELL", "curl --fail --connect-timeout 5 http://localhost:9000/docs || exit 1"]
interval: 30s
timeout: 10s
retries: 5
start_period: 120s
restart: unless-stopped
volumes:
hf_cache:Prerequisites:
- NVIDIA Container Toolkit installed (
nvidia-container-toolkit) - Hugging Face token with access to diarization models (see below)
HuggingFace Token Setup:
Your HF_TOKEN must have access to these three HuggingFace models:
- pyannote/speaker-diarization-3.1 — Accept the license agreement
- pyannote/segmentation-3.0 — Accept the license agreement
- pyannote/speaker-diarization-community-1 — Accept the license agreement
Visit each model page above and click "Access repository" or "Accept license" to grant your token access.
Setup:
Create a .env file in the same directory as your compose.yaml:
# .env
HF_TOKEN=your_huggingface_token_here
PORT=9000
ASR_MODEL=large-v3
ASR_ENGINE=faster_whisperThen deploy:
docker compose pull
docker compose up -dEndpoint: http://localhost:9000/asr
Alternative: OpenAI-compatible endpoint with diarization support
2. Configuration
- Open Settings (gear icon)
- Go to Advanced Settings
- Toggle Speaker Diarization
The system will validate that your STT backend supports diarization before enabling.
3. Adjust Settings
| Setting | Description | Default |
|---|---|---|
| Speaker Diarization | Enable/disable | Off |
| Min Speakers | Minimum speakers to detect | 2 |
| Remember Speakers | Save voice profiles across sessions | Off |
Min Speakers (2-6): The minimum number of speakers to look for. Set higher for larger panels.
Remember Speakers: When enabled, speaker voice profiles (embeddings) are saved to disk and restored on the next session. This enables cross-session speaker recognition.
How It Works
Speaker Detection
When enabled, Intervu:
- Sends system audio to the STT backend with
diarize=true - Receives per-speaker segments with embeddings
- Matches speakers across audio chunks using cosine similarity
- Assigns labels like
SPEAKER_00,SPEAKER_01, etc.
Cross-Chunk Consistency
The system maintains speaker identity across audio chunks using:
- 256-dim speaker embeddings — Voice fingerprints
- Cosine similarity matching — Compare embeddings between chunks
- Weighted moving average (30% new / 70% existing) — Update speaker profiles
- Ambiguity check — Skip matching when top 2 candidates are within 5% similarity
Display in Transcript
[SPEAKER_00]: Can you tell me about your experience?
[SPEAKER_01]: I've worked with React for 5 years...
[SPEAKER_00]: That's great. What about Node.js?Speaker Monitor
When diarization is enabled, a speaker icon appears in the title bar. Click it to open the Speaker Monitor panel.
Panel Contents
| Info | Description |
|---|---|
| Speaker Name | Current display name (or SPEAKER_XX) |
| First Seen | When the speaker was first detected |
| Last Spoke | Relative time since last speech |
| Message Count | Total messages from this speaker |
| Last Messages | Preview of last 3 messages |
Actions
- Rename: Click the speaker name to edit
- Clear: Reset all speaker data (names + embeddings)
- Accept/Dismiss: AI naming suggestions (see below)
AI-Assisted Speaker Naming
Intervu can automatically detect speaker names from conversational cues.
How It Works
The LLM analyzes transcript content for name mentions:
Transcript: "Hi Anna, can you tell me about your background?"
"Sure, John. I've worked in..."
Detected: SPEAKER_00 = "John"
SPEAKER_01 = "Anna"Suggestions
When the LLM detects potential names:
- Sparkle hint (✨) appears next to unnamed speakers
- Click the speaker label to see the suggestion
- Accept to apply the name, or Dismiss to ignore
Trigger Conditions
- Minimum 5 non-'you' transcript entries
- 8-second debounce after last transcript update
- Only suggests for unnamed speakers (
SPEAKER_XX) - Dismissed suggestions are remembered per session
Speaker Naming
Manual Renaming
Click any speaker label in the transcript to rename:
- Click the label (e.g.,
[SPEAKER_00]) - Type a display name (e.g., "John Smith")
- Press Enter to confirm
Names persist in your settings across sessions.
Speaker Map
The speaker map stores display names:
{
"SPEAKER_00": "John Smith",
"SPEAKER_01": "Sarah Chen",
"SPEAKER_02": "Manager #3"
}This is saved in your settings and restored on app restart.
LLM Integration
Speaker-Aware Prompting
When diarization is active, the LLM receives additional context:
Detected Speakers:
- SPEAKER_00 (John Smith)
- SPEAKER_01 (Sarah Chen)
[John Smith]: Can you tell me about your experience?
[You]: I've worked with React for 5 years...
[Sarah Chen]: What about Node.js?Benefits
- Better question attribution — Know who asked what
- More relevant answers — Context from all speakers
- Improved flow — Track conversation across multiple people
Audio Chunk Duration
When diarization is enabled, audio chunk duration is automatically adjusted:
- Minimum: 6 seconds (enforced for better speaker separation)
- UI lock: The chunk duration input is disabled with an explanatory note
This trade-off provides better diarization accuracy at the cost of slightly longer initial response time.
Clearing Speaker Data
Clear Options
The clear menu (trash icon) has three options:
| Option | Effect |
|---|---|
| Clear transcript & answers | Also clears speaker names and embeddings |
| Clear speaker data | Resets speaker names + embeddings only |
| Clear all (including ratings) | Clears everything including ratings |
Cross-Session Persistence
If Remember Speakers is enabled:
- Speaker profiles are saved to
speaker-profiles.json - Restored on next listening session
- Cleared when you click Clear speaker data
Troubleshooting
Diarization Not Working
Symptoms: All speech shows as single speaker
Causes:
- STT backend doesn't support diarization
- Backend returns error
Solutions:
- Verify you're using a compatible backend (
whisperx-asr-service) - Check STT endpoint URL in Settings
- Disable and re-enable diarization to re-validate
Too Many Speakers Detected
Symptoms: More speakers than actual people
Solutions:
- Increase Min Speakers setting
- Check audio quality (noise can create false speakers)
- Use "Clear speaker data" to reset
Speakers Switching Incorrectly
Symptoms: Same person labeled as different speakers
Causes:
- Similar voices
- Short audio chunks
- Cross-talk (people speaking simultaneously)
Solutions:
- Ensure audio chunk duration is at least 6 seconds
- Minimize cross-talk when possible
- Manually rename consistent misidentifications
Backend Error on Enable
Symptoms: Toggle switches off immediately
Solutions:
- Check STT backend is running
- Verify backend supports
/asrendpoint or OpenAI-compatible diarization - Check logs for specific error message
High CPU/Memory Usage
Symptoms: System slowdown during diarization
Causes:
- Diarization is computationally expensive
- Large number of speakers
Solutions:
- Use GPU-accelerated STT backend
- Reduce Min Speakers setting
- Disable when not needed
Performance Tips
Optimal Chunk Duration
Keep audio chunks at 6+ seconds for best results. The UI enforces this automatically.
Backend Selection
| Backend | Diarization | Embedding Support | Notes |
|---|---|---|---|
learnedmachine/whisperx-asr-service | ✅ | ✅ | Recommended |
| OpenAI-compatible | Varies | Varies | Check provider |
Speaker Count
Set Min Speakers to the actual expected number. Higher values increase processing time.
Limitations
Current Constraints
| Limitation | Details |
|---|---|
| Only system audio diarized | Microphone (your voice) is always "you" |
| Cross-talk | Simultaneous speech may cause misattribution |
| Short utterances | Brief phrases (< 2s) may not diarize well |
| Voice similarity | Very similar voices may be confused |
| Backend dependent | Requires compatible STT service |
Known Issues
- 500 errors: Some backends fail on NaN embeddings. Intervu automatically retries without embeddings.
- Session reset: Speaker data is lost on app restart unless "Remember Speakers" is enabled.
Comparison: Diarization On vs Off
| Feature | Diarization Off | Diarization On |
|---|---|---|
| Speakers | Interviewer / You | Individual speakers + You |
| Labels | [Interviewer], [You] | [SPEAKER_00], [You] |
| Speaker monitor | Hidden | Visible |
| AI naming | N/A | Available |
| Chunk duration | 3s default | 6s minimum |
| CPU usage | Lower | Higher |
| Backend | Any OpenAI-compatible | Requires diarization support |
Next Steps
- Interview Timer — Track interview duration
- Advanced Mode — Dual-LLM question extraction
- STT Endpoint Configuration — Backend setup