Skip to content

Speaker Diarization

Multi-speaker identification for panel interviews and multi-participant conversations.

Overview

Speaker diarization automatically identifies and labels different speakers in your interview audio. Instead of just "interviewer" and "you", Intervu can detect individual speakers like "John", "Sarah", or "Manager #1".

When to Use

Speaker diarization is useful for:

  • Panel interviews — Multiple interviewers asking questions
  • Group interviews — Several candidates being interviewed together
  • Team meetings — Multiple participants in the conversation
  • Complex Q&A — Following which interviewer asked what

Enabling Speaker Diarization

1. Prerequisites

You need an STT backend that supports diarization:

Recommended: learnedmachine/whisperx-asr-service

yaml
# docker/whisperx-asr/compose.yaml
services:
  whisperx-asr:
    image: learnedmachine/whisperx-asr-service
    container_name: whisperx-asr
    ports:
      - "${PORT:-9000}:9000"
    environment:
      - ASR_MODEL=${ASR_MODEL:-large-v3}
      - ASR_ENGINE=${ASR_ENGINE:-faster_whisper}
      - HF_TOKEN=${HF_TOKEN:?HF_TOKEN is required for diarization}
    volumes:
      - hf_cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD-SHELL", "curl --fail --connect-timeout 5 http://localhost:9000/docs || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s
    restart: unless-stopped

volumes:
  hf_cache:

Prerequisites:

  • NVIDIA Container Toolkit installed (nvidia-container-toolkit)
  • Hugging Face token with access to diarization models (see below)

HuggingFace Token Setup:

Your HF_TOKEN must have access to these three HuggingFace models:

  1. pyannote/speaker-diarization-3.1 — Accept the license agreement
  2. pyannote/segmentation-3.0 — Accept the license agreement
  3. pyannote/speaker-diarization-community-1 — Accept the license agreement

Visit each model page above and click "Access repository" or "Accept license" to grant your token access.

Setup:

Create a .env file in the same directory as your compose.yaml:

bash
# .env
HF_TOKEN=your_huggingface_token_here
PORT=9000
ASR_MODEL=large-v3
ASR_ENGINE=faster_whisper

Then deploy:

bash
docker compose pull
docker compose up -d

Endpoint: http://localhost:9000/asr

Alternative: OpenAI-compatible endpoint with diarization support

2. Configuration

  1. Open Settings (gear icon)
  2. Go to Advanced Settings
  3. Toggle Speaker Diarization

The system will validate that your STT backend supports diarization before enabling.

3. Adjust Settings

SettingDescriptionDefault
Speaker DiarizationEnable/disableOff
Min SpeakersMinimum speakers to detect2
Remember SpeakersSave voice profiles across sessionsOff

Min Speakers (2-6): The minimum number of speakers to look for. Set higher for larger panels.

Remember Speakers: When enabled, speaker voice profiles (embeddings) are saved to disk and restored on the next session. This enables cross-session speaker recognition.


How It Works

Speaker Detection

When enabled, Intervu:

  1. Sends system audio to the STT backend with diarize=true
  2. Receives per-speaker segments with embeddings
  3. Matches speakers across audio chunks using cosine similarity
  4. Assigns labels like SPEAKER_00, SPEAKER_01, etc.

Cross-Chunk Consistency

The system maintains speaker identity across audio chunks using:

  • 256-dim speaker embeddings — Voice fingerprints
  • Cosine similarity matching — Compare embeddings between chunks
  • Weighted moving average (30% new / 70% existing) — Update speaker profiles
  • Ambiguity check — Skip matching when top 2 candidates are within 5% similarity

Display in Transcript

[SPEAKER_00]: Can you tell me about your experience?
[SPEAKER_01]: I've worked with React for 5 years...
[SPEAKER_00]: That's great. What about Node.js?

Speaker Monitor

When diarization is enabled, a speaker icon appears in the title bar. Click it to open the Speaker Monitor panel.

Panel Contents

InfoDescription
Speaker NameCurrent display name (or SPEAKER_XX)
First SeenWhen the speaker was first detected
Last SpokeRelative time since last speech
Message CountTotal messages from this speaker
Last MessagesPreview of last 3 messages

Actions

  • Rename: Click the speaker name to edit
  • Clear: Reset all speaker data (names + embeddings)
  • Accept/Dismiss: AI naming suggestions (see below)

AI-Assisted Speaker Naming

Intervu can automatically detect speaker names from conversational cues.

How It Works

The LLM analyzes transcript content for name mentions:

Transcript: "Hi Anna, can you tell me about your background?"
            "Sure, John. I've worked in..."

Detected: SPEAKER_00 = "John"
          SPEAKER_01 = "Anna"

Suggestions

When the LLM detects potential names:

  1. Sparkle hint (✨) appears next to unnamed speakers
  2. Click the speaker label to see the suggestion
  3. Accept to apply the name, or Dismiss to ignore

Trigger Conditions

  • Minimum 5 non-'you' transcript entries
  • 8-second debounce after last transcript update
  • Only suggests for unnamed speakers (SPEAKER_XX)
  • Dismissed suggestions are remembered per session

Speaker Naming

Manual Renaming

Click any speaker label in the transcript to rename:

  1. Click the label (e.g., [SPEAKER_00])
  2. Type a display name (e.g., "John Smith")
  3. Press Enter to confirm

Names persist in your settings across sessions.

Speaker Map

The speaker map stores display names:

json
{
  "SPEAKER_00": "John Smith",
  "SPEAKER_01": "Sarah Chen",
  "SPEAKER_02": "Manager #3"
}

This is saved in your settings and restored on app restart.


LLM Integration

Speaker-Aware Prompting

When diarization is active, the LLM receives additional context:

Detected Speakers:
- SPEAKER_00 (John Smith)
- SPEAKER_01 (Sarah Chen)

[John Smith]: Can you tell me about your experience?
[You]: I've worked with React for 5 years...
[Sarah Chen]: What about Node.js?

Benefits

  • Better question attribution — Know who asked what
  • More relevant answers — Context from all speakers
  • Improved flow — Track conversation across multiple people

Audio Chunk Duration

When diarization is enabled, audio chunk duration is automatically adjusted:

  • Minimum: 6 seconds (enforced for better speaker separation)
  • UI lock: The chunk duration input is disabled with an explanatory note

This trade-off provides better diarization accuracy at the cost of slightly longer initial response time.


Clearing Speaker Data

Clear Options

The clear menu (trash icon) has three options:

OptionEffect
Clear transcript & answersAlso clears speaker names and embeddings
Clear speaker dataResets speaker names + embeddings only
Clear all (including ratings)Clears everything including ratings

Cross-Session Persistence

If Remember Speakers is enabled:

  • Speaker profiles are saved to speaker-profiles.json
  • Restored on next listening session
  • Cleared when you click Clear speaker data

Troubleshooting

Diarization Not Working

Symptoms: All speech shows as single speaker

Causes:

  • STT backend doesn't support diarization
  • Backend returns error

Solutions:

  • Verify you're using a compatible backend (whisperx-asr-service)
  • Check STT endpoint URL in Settings
  • Disable and re-enable diarization to re-validate

Too Many Speakers Detected

Symptoms: More speakers than actual people

Solutions:

  • Increase Min Speakers setting
  • Check audio quality (noise can create false speakers)
  • Use "Clear speaker data" to reset

Speakers Switching Incorrectly

Symptoms: Same person labeled as different speakers

Causes:

  • Similar voices
  • Short audio chunks
  • Cross-talk (people speaking simultaneously)

Solutions:

  • Ensure audio chunk duration is at least 6 seconds
  • Minimize cross-talk when possible
  • Manually rename consistent misidentifications

Backend Error on Enable

Symptoms: Toggle switches off immediately

Solutions:

  • Check STT backend is running
  • Verify backend supports /asr endpoint or OpenAI-compatible diarization
  • Check logs for specific error message

High CPU/Memory Usage

Symptoms: System slowdown during diarization

Causes:

  • Diarization is computationally expensive
  • Large number of speakers

Solutions:

  • Use GPU-accelerated STT backend
  • Reduce Min Speakers setting
  • Disable when not needed

Performance Tips

Optimal Chunk Duration

Keep audio chunks at 6+ seconds for best results. The UI enforces this automatically.

Backend Selection

BackendDiarizationEmbedding SupportNotes
learnedmachine/whisperx-asr-serviceRecommended
OpenAI-compatibleVariesVariesCheck provider

Speaker Count

Set Min Speakers to the actual expected number. Higher values increase processing time.


Limitations

Current Constraints

LimitationDetails
Only system audio diarizedMicrophone (your voice) is always "you"
Cross-talkSimultaneous speech may cause misattribution
Short utterancesBrief phrases (< 2s) may not diarize well
Voice similarityVery similar voices may be confused
Backend dependentRequires compatible STT service

Known Issues

  • 500 errors: Some backends fail on NaN embeddings. Intervu automatically retries without embeddings.
  • Session reset: Speaker data is lost on app restart unless "Remember Speakers" is enabled.

Comparison: Diarization On vs Off

FeatureDiarization OffDiarization On
SpeakersInterviewer / YouIndividual speakers + You
Labels[Interviewer], [You][SPEAKER_00], [You]
Speaker monitorHiddenVisible
AI namingN/AAvailable
Chunk duration3s default6s minimum
CPU usageLowerHigher
BackendAny OpenAI-compatibleRequires diarization support

Next Steps

Made with ❤️by Aldrick Bonaobra