LLM Endpoint
Configure the Language Model endpoint for answer generation.
Overview
Intervu uses an OpenAI-compatible LLM to generate suggested answers based on the interview transcript and your resume.
Configuration
Open Settings (gear icon) and locate the LLM section:
| Setting | Description | Example |
|---|---|---|
| LLM Endpoint | URL for the LLM API | http://localhost:11434/v1/chat/completions |
| LLM Model | Model identifier | llama3.2:3b |
| LLM API Key | API key (if required) | Leave empty for Ollama |
Default Values
LLM Endpoint: http://localhost:11434/v1/chat/completions
LLM Model: llama3.2:3b
LLM API Key: (empty)Testing the Connection
After configuring the endpoint:
- Click Test LLM button in Settings
- Intervu sends a simple test prompt
- A success message with the response confirms it's working
Endpoint Options
Ollama (Recommended)
Free, local, private.
bash
# Start Ollama
ollama serve
# Pull a model
ollama pull llama3.2:3bConfiguration:
- Endpoint:
http://localhost:11434/v1/chat/completions - Model:
llama3.2:3b - API Key: Leave empty
OpenWebUI
Your self-hosted OpenAI-compatible interface.
Configuration:
- Endpoint:
https://your-instance/api/chat/completions - Model: Your configured model name
- API Key: Your OpenWebUI API key
OpenAI API
Cloud-based, paid.
Configuration:
- Endpoint:
https://api.openai.com/v1/chat/completions - Model:
gpt-4o-mini(recommended) orgpt-4o - API Key: Your OpenAI API key
Cost
OpenAI API charges per token. Real-time transcription can generate significant tokens during longer interviews.
Model Selection
For Ollama
| Model | Parameters | Speed | Quality | Use Case |
|---|---|---|---|---|
llama3.2:3b | 3B | ⚡ Fast | Good | Real-time interviews |
llama3.2:latest | 3B | ⚡ Fast | Good | Real-time interviews |
phi3:mini | 3.8B | ⚡ Fast | Good | Low-resource systems |
mistral:7b | 7B | Medium | Better | Better answers |
llama3.1:8b | 8B | Medium | Better | More comprehensive answers |
For OpenAI
| Model | Speed | Cost | Quality |
|---|---|---|---|
gpt-4o-mini | Fast | Low | Good |
gpt-4o | Medium | Medium | Excellent |
gpt-4-turbo | Medium | High | Excellent |
Advanced LLM Settings
In Settings → Advanced:
Max Tokens
Maximum number of tokens in the response.
- Default:
0(unlimited) - Lower values: Shorter, more concise answers
- Higher values: Longer, more detailed answers
Thinking Mode
For models that support extended reasoning (e.g., Claude models).
- Default: Off
- When enabled: Model may use more tokens for reasoning
How Answers Are Generated
When the interviewer speaks:
- Audio is transcribed by STT
- Transcript + your resume + system prompt are sent to LLM
- LLM generates a suggested answer
- Answer streams to the UI in real-time
Message Format
json
{
"messages": [
{"role": "system", "content": "System prompt + resume"},
{"role": "user", "content": "Interviewer's question"},
{"role": "assistant", "content": "Your previous answer (if any)"}
]
}Troubleshooting
Connection Failed
bash
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Start if not running
ollama serveOut of Memory
- Close other applications
- Use a smaller model (
llama3.2:3b→phi3:mini) - Reduce GPU layers:
OLLAMA_GPU_LAYERS=10 ollama serve
Slow Responses
- Use a smaller model
- Ensure GPU acceleration
- Reduce max tokens
- Check network latency (for remote endpoints)
Poor Answer Quality
- Provide a detailed resume in Settings
- Customize the system prompt
- Try a larger/capable model
- Rate answers (thumbs up/down) to improve future suggestions
Next Steps
- Resume & Prompt — Configure your background
- Advanced Mode — Dual-LLM question extraction