Skip to content

LLM Endpoint

Configure the Language Model endpoint for answer generation.

Overview

Intervu uses an OpenAI-compatible LLM to generate suggested answers based on the interview transcript and your resume.


Configuration

Open Settings (gear icon) and locate the LLM section:

SettingDescriptionExample
LLM EndpointURL for the LLM APIhttp://localhost:11434/v1/chat/completions
LLM ModelModel identifierllama3.2:3b
LLM API KeyAPI key (if required)Leave empty for Ollama

Default Values

LLM Endpoint: http://localhost:11434/v1/chat/completions
LLM Model: llama3.2:3b
LLM API Key: (empty)

Testing the Connection

After configuring the endpoint:

  1. Click Test LLM button in Settings
  2. Intervu sends a simple test prompt
  3. A success message with the response confirms it's working

Endpoint Options

Free, local, private.

bash
# Start Ollama
ollama serve

# Pull a model
ollama pull llama3.2:3b

Configuration:

  • Endpoint: http://localhost:11434/v1/chat/completions
  • Model: llama3.2:3b
  • API Key: Leave empty

OpenWebUI

Your self-hosted OpenAI-compatible interface.

Configuration:

  • Endpoint: https://your-instance/api/chat/completions
  • Model: Your configured model name
  • API Key: Your OpenWebUI API key

OpenAI API

Cloud-based, paid.

Configuration:

  • Endpoint: https://api.openai.com/v1/chat/completions
  • Model: gpt-4o-mini (recommended) or gpt-4o
  • API Key: Your OpenAI API key

Cost

OpenAI API charges per token. Real-time transcription can generate significant tokens during longer interviews.


Model Selection

For Ollama

ModelParametersSpeedQualityUse Case
llama3.2:3b3B⚡ FastGoodReal-time interviews
llama3.2:latest3B⚡ FastGoodReal-time interviews
phi3:mini3.8B⚡ FastGoodLow-resource systems
mistral:7b7BMediumBetterBetter answers
llama3.1:8b8BMediumBetterMore comprehensive answers

For OpenAI

ModelSpeedCostQuality
gpt-4o-miniFastLowGood
gpt-4oMediumMediumExcellent
gpt-4-turboMediumHighExcellent

Advanced LLM Settings

In Settings → Advanced:

Max Tokens

Maximum number of tokens in the response.

  • Default: 0 (unlimited)
  • Lower values: Shorter, more concise answers
  • Higher values: Longer, more detailed answers

Thinking Mode

For models that support extended reasoning (e.g., Claude models).

  • Default: Off
  • When enabled: Model may use more tokens for reasoning

How Answers Are Generated

When the interviewer speaks:

  1. Audio is transcribed by STT
  2. Transcript + your resume + system prompt are sent to LLM
  3. LLM generates a suggested answer
  4. Answer streams to the UI in real-time

Message Format

json
{
  "messages": [
    {"role": "system", "content": "System prompt + resume"},
    {"role": "user", "content": "Interviewer's question"},
    {"role": "assistant", "content": "Your previous answer (if any)"}
  ]
}

Troubleshooting

Connection Failed

bash
# Check if Ollama is running
curl http://localhost:11434/api/tags

# Start if not running
ollama serve

Out of Memory

  • Close other applications
  • Use a smaller model (llama3.2:3bphi3:mini)
  • Reduce GPU layers: OLLAMA_GPU_LAYERS=10 ollama serve

Slow Responses

  • Use a smaller model
  • Ensure GPU acceleration
  • Reduce max tokens
  • Check network latency (for remote endpoints)

Poor Answer Quality

  • Provide a detailed resume in Settings
  • Customize the system prompt
  • Try a larger/capable model
  • Rate answers (thumbs up/down) to improve future suggestions

Next Steps

Made with ❤️by Aldrick Bonaobra