Speaker Diarization

Multi-speaker identification for panel interviews and multi-participant conversations.

Overview

Speaker diarization automatically identifies and labels different speakers in your interview audio. Instead of just "interviewer" and "you", Intervu can detect individual speakers like "John", "Sarah", or "Manager #1".

When to Use

Speaker diarization is useful for:

Panel interviews — Multiple interviewers asking questions
Group interviews — Several candidates being interviewed together
Team meetings — Multiple participants in the conversation
Complex Q&A — Following which interviewer asked what

Enabling Speaker Diarization

1. Prerequisites

You need an STT backend that supports diarization:

Recommended: learnedmachine/whisperx-asr-service

yaml

# docker/whisperx-asr/compose.yaml
services:
  whisperx-asr:
    image: learnedmachine/whisperx-asr-service
    container_name: whisperx-asr
    ports:
      - "${PORT:-9000}:9000"
    environment:
      - ASR_MODEL=${ASR_MODEL:-large-v3}
      - ASR_ENGINE=${ASR_ENGINE:-faster_whisper}
      - HF_TOKEN=${HF_TOKEN:?HF_TOKEN is required for diarization}
    volumes:
      - hf_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD-SHELL", "curl --fail --connect-timeout 5 http://localhost:9000/docs || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s
    restart: unless-stopped

volumes:
  hf_cache:

Prerequisites:

NVIDIA Container Toolkit installed (nvidia-container-toolkit)
Hugging Face token with access to diarization models (see below)

HuggingFace Token Setup:

Your HF_TOKEN must have access to these three HuggingFace models:

pyannote/speaker-diarization-3.1 — Accept the license agreement
pyannote/segmentation-3.0 — Accept the license agreement
pyannote/speaker-diarization-community-1 — Accept the license agreement

Visit each model page above and click "Access repository" or "Accept license" to grant your token access.

Setup:

Create a .env file in the same directory as your compose.yaml:

bash

# .env
HF_TOKEN=your_huggingface_token_here
PORT=9000
ASR_MODEL=large-v3
ASR_ENGINE=faster_whisper

Then deploy:

bash

docker compose pull
docker compose up -d

Endpoint: http://localhost:9000/asr

Alternative: OpenAI-compatible endpoint with diarization support

2. Configuration

Open Settings (gear icon)
Go to Advanced Settings
Toggle Speaker Diarization

The system will validate that your STT backend supports diarization before enabling.

3. Adjust Settings

Setting	Description	Default
Speaker Diarization	Enable/disable	Off
Min Speakers	Minimum speakers to detect	2
Remember Speakers	Save voice profiles across sessions	Off

Min Speakers (2-6): The minimum number of speakers to look for. Set higher for larger panels.

Remember Speakers: When enabled, speaker voice profiles (embeddings) are saved to disk and restored on the next session. This enables cross-session speaker recognition.

How It Works

Speaker Detection

When enabled, Intervu:

Sends system audio to the STT backend with diarize=true
Receives per-speaker segments with embeddings
Matches speakers across audio chunks using cosine similarity
Assigns labels like SPEAKER_00, SPEAKER_01, etc.

Cross-Chunk Consistency

The system maintains speaker identity across audio chunks using:

256-dim speaker embeddings — Voice fingerprints
Cosine similarity matching — Compare embeddings between chunks
Weighted moving average (30% new / 70% existing) — Update speaker profiles
Ambiguity check — Skip matching when top 2 candidates are within 5% similarity

Display in Transcript

[SPEAKER_00]: Can you tell me about your experience?
[SPEAKER_01]: I've worked with React for 5 years...
[SPEAKER_00]: That's great. What about Node.js?

Speaker Monitor

When diarization is enabled, a speaker icon appears in the title bar. Click it to open the Speaker Monitor panel.

Panel Contents

Info	Description
Speaker Name	Current display name (or `SPEAKER_XX`)
First Seen	When the speaker was first detected
Last Spoke	Relative time since last speech
Message Count	Total messages from this speaker
Last Messages	Preview of last 3 messages

Actions

Rename: Click the speaker name to edit
Clear: Reset all speaker data (names + embeddings)
Accept/Dismiss: AI naming suggestions (see below)

AI-Assisted Speaker Naming

Intervu can automatically detect speaker names from conversational cues.

How It Works

The LLM analyzes transcript content for name mentions:

Transcript: "Hi Anna, can you tell me about your background?"
            "Sure, John. I've worked in..."

Detected: SPEAKER_00 = "John"
          SPEAKER_01 = "Anna"

Suggestions

When the LLM detects potential names:

Sparkle hint (✨) appears next to unnamed speakers
Click the speaker label to see the suggestion
Accept to apply the name, or Dismiss to ignore

Trigger Conditions

Minimum 5 non-'you' transcript entries
8-second debounce after last transcript update
Only suggests for unnamed speakers (SPEAKER_XX)
Dismissed suggestions are remembered per session

Speaker Naming

Manual Renaming

Click any speaker label in the transcript to rename:

Click the label (e.g., [SPEAKER_00])
Type a display name (e.g., "John Smith")
Press Enter to confirm

Names persist in your settings across sessions.

Speaker Map

The speaker map stores display names:

json

{
  "SPEAKER_00": "John Smith",
  "SPEAKER_01": "Sarah Chen",
  "SPEAKER_02": "Manager #3"
}

This is saved in your settings and restored on app restart.

LLM Integration

Speaker-Aware Prompting

When diarization is active, the LLM receives additional context:

Detected Speakers:
- SPEAKER_00 (John Smith)
- SPEAKER_01 (Sarah Chen)

[John Smith]: Can you tell me about your experience?
[You]: I've worked with React for 5 years...
[Sarah Chen]: What about Node.js?

Benefits

Better question attribution — Know who asked what
More relevant answers — Context from all speakers
Improved flow — Track conversation across multiple people

Audio Chunk Duration

When diarization is enabled, audio chunk duration is automatically adjusted:

Minimum: 6 seconds (enforced for better speaker separation)
UI lock: The chunk duration input is disabled with an explanatory note

This trade-off provides better diarization accuracy at the cost of slightly longer initial response time.

Clearing Speaker Data

Clear Options

The clear menu (trash icon) has three options:

Option	Effect
Clear transcript & answers	Also clears speaker names and embeddings
Clear speaker data	Resets speaker names + embeddings only
Clear all (including ratings)	Clears everything including ratings

Cross-Session Persistence

If Remember Speakers is enabled:

Speaker profiles are saved to speaker-profiles.json
Restored on next listening session
Cleared when you click Clear speaker data

Troubleshooting

Diarization Not Working

Symptoms: All speech shows as single speaker

Causes:

STT backend doesn't support diarization
Backend returns error

Solutions:

Verify you're using a compatible backend (whisperx-asr-service)
Check STT endpoint URL in Settings
Disable and re-enable diarization to re-validate

Too Many Speakers Detected

Symptoms: More speakers than actual people

Solutions:

Increase Min Speakers setting
Check audio quality (noise can create false speakers)
Use "Clear speaker data" to reset

Speakers Switching Incorrectly

Symptoms: Same person labeled as different speakers

Causes:

Similar voices
Short audio chunks
Cross-talk (people speaking simultaneously)

Solutions:

Ensure audio chunk duration is at least 6 seconds
Minimize cross-talk when possible
Manually rename consistent misidentifications

Backend Error on Enable

Symptoms: Toggle switches off immediately

Solutions:

Check STT backend is running
Verify backend supports /asr endpoint or OpenAI-compatible diarization
Check logs for specific error message

High CPU/Memory Usage

Symptoms: System slowdown during diarization

Causes:

Diarization is computationally expensive
Large number of speakers

Solutions:

Use GPU-accelerated STT backend
Reduce Min Speakers setting
Disable when not needed

Performance Tips

Optimal Chunk Duration

Keep audio chunks at 6+ seconds for best results. The UI enforces this automatically.

Backend Selection

Backend	Diarization	Embedding Support	Notes
`learnedmachine/whisperx-asr-service`	✅	✅	Recommended
OpenAI-compatible	Varies	Varies	Check provider

Speaker Count

Set Min Speakers to the actual expected number. Higher values increase processing time.

Limitations

Current Constraints

Limitation	Details
Only system audio diarized	Microphone (your voice) is always "you"
Cross-talk	Simultaneous speech may cause misattribution
Short utterances	Brief phrases (< 2s) may not diarize well
Voice similarity	Very similar voices may be confused
Backend dependent	Requires compatible STT service

Known Issues

500 errors: Some backends fail on NaN embeddings. Intervu automatically retries without embeddings.
Session reset: Speaker data is lost on app restart unless "Remember Speakers" is enabled.

Comparison: Diarization On vs Off

Feature	Diarization Off	Diarization On
Speakers	Interviewer / You	Individual speakers + You
Labels	`[Interviewer]`, `[You]`	`[SPEAKER_00]`, `[You]`
Speaker monitor	Hidden	Visible
AI naming	N/A	Available
Chunk duration	3s default	6s minimum
CPU usage	Lower	Higher
Backend	Any OpenAI-compatible	Requires diarization support

Next Steps

Interview Timer — Track interview duration
Advanced Mode — Dual-LLM question extraction
STT Endpoint Configuration — Backend setup

Speaker Diarization ​

Overview ​

When to Use ​

Enabling Speaker Diarization ​

1. Prerequisites ​

2. Configuration ​

3. Adjust Settings ​

How It Works ​

Speaker Detection ​

Cross-Chunk Consistency ​

Display in Transcript ​

Speaker Monitor ​

Panel Contents ​

Actions ​

AI-Assisted Speaker Naming ​

How It Works ​

Suggestions ​

Trigger Conditions ​

Speaker Naming ​

Manual Renaming ​

Speaker Map ​

LLM Integration ​

Speaker-Aware Prompting ​

Benefits ​

Audio Chunk Duration ​

Clearing Speaker Data ​

Clear Options ​

Cross-Session Persistence ​

Troubleshooting ​

Diarization Not Working ​

Too Many Speakers Detected ​

Speakers Switching Incorrectly ​

Backend Error on Enable ​

High CPU/Memory Usage ​

Performance Tips ​

Optimal Chunk Duration ​

Backend Selection ​

Speaker Count ​

Limitations ​

Current Constraints ​

Known Issues ​

Comparison: Diarization On vs Off ​

Next Steps ​

Speaker Diarization

Overview

When to Use

Enabling Speaker Diarization

1. Prerequisites

2. Configuration

3. Adjust Settings

How It Works

Speaker Detection

Cross-Chunk Consistency

Display in Transcript

Speaker Monitor

Panel Contents

Actions

AI-Assisted Speaker Naming

How It Works

Suggestions

Trigger Conditions

Speaker Naming

Manual Renaming

Speaker Map

LLM Integration

Speaker-Aware Prompting

Benefits

Audio Chunk Duration

Clearing Speaker Data

Clear Options

Cross-Session Persistence

Troubleshooting

Diarization Not Working

Too Many Speakers Detected

Speakers Switching Incorrectly

Backend Error on Enable

High CPU/Memory Usage

Performance Tips

Optimal Chunk Duration

Backend Selection

Speaker Count

Limitations

Current Constraints

Known Issues

Comparison: Diarization On vs Off

Next Steps