LLM Endpoint

Configure the Language Model endpoint for answer generation.

Overview

Intervu uses an OpenAI-compatible LLM to generate suggested answers based on the interview transcript, your resume, Q&A Bank, and Company/Role Context.

Configuration

Open Settings (gear icon) and locate the LLM section:

Setting	Description	Example
LLM Endpoint	URL for the LLM API	`http://localhost:11434/v1/chat/completions`
LLM Model	Model identifier	`llama3.2:3b`
LLM API Key	API key (if required)	Leave empty for Ollama

Default Values

LLM Endpoint: http://localhost:11434/v1/chat/completions
LLM Model: llama3.2:3b
LLM API Key: (empty)

Testing the Connection

After configuring the endpoint:

Click Test LLM button in Settings
Intervu sends a simple test prompt
A success message with the response confirms it's working

Endpoint Options

Ollama (Recommended)

Free, local, private.

bash

# Start Ollama
ollama serve

# Pull a model
ollama pull llama3.2:3b

Configuration:

Endpoint: http://localhost:11434/v1/chat/completions
Model: llama3.2:3b
API Key: Leave empty

OpenWebUI

Your self-hosted OpenAI-compatible interface.

Configuration:

Endpoint: https://your-instance/api/chat/completions
Model: Your configured model name
API Key: Your OpenWebUI API key

OpenAI API

Cloud-based, paid.

Configuration:

Endpoint: https://api.openai.com/v1/chat/completions
Model: gpt-4o-mini (recommended) or gpt-4o
API Key: Your OpenAI API key

Cost

OpenAI API charges per token. Real-time transcription can generate significant tokens during longer interviews.

Model Selection

For Ollama

Model	Parameters	Speed	Quality	Use Case
`llama3.2:3b`	3B	⚡ Fast	Good	Real-time interviews
`llama3.2:latest`	3B	⚡ Fast	Good	Real-time interviews
`phi3:mini`	3.8B	⚡ Fast	Good	Low-resource systems
`mistral:7b`	7B	Medium	Better	Better answers
`llama3.1:8b`	8B	Medium	Better	More comprehensive answers

For OpenAI

Model	Speed	Cost	Quality
`gpt-4o-mini`	Fast	Low	Good
`gpt-4o`	Medium	Medium	Excellent
`gpt-4-turbo`	Medium	High	Excellent

Advanced LLM Settings

In Settings → Advanced:

Max Tokens

Maximum number of tokens in the response.

Default: 0 (unlimited)
Lower values: Shorter, more concise answers
Higher values: Longer, more detailed answers

Thinking Mode

For models that support extended reasoning (e.g., Claude models).

Default: Off
When enabled: Model may use more tokens for reasoning

How Answers Are Generated

When the interviewer speaks:

Audio is transcribed by STT
Transcript + your resume + Company/Role Context + Q&A Bank + system prompt are sent to LLM
LLM generates a suggested answer
Answer streams to the UI in real-time

Message Format

json

{
  "messages": [
    {"role": "system", "content": "System prompt + resume + company context + Q&A bank"},
    {"role": "user", "content": "Interviewer's question"},
    {"role": "assistant", "content": "Your previous answer (if any)"}
  ]
}

Troubleshooting

Connection Failed

bash

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Start if not running
ollama serve

Out of Memory

Close other applications
Use a smaller model (llama3.2:3b → phi3:mini)
Reduce GPU layers: OLLAMA_GPU_LAYERS=10 ollama serve

Slow Responses

Use a smaller model
Ensure GPU acceleration
Reduce max tokens
Check network latency (for remote endpoints)

Poor Answer Quality

Provide a detailed resume in Settings
Fill in Q&A Bank with pre-prepared answers and preferences
Add Company / Role Context with the job description and company info
Customize the system prompt
Try a larger/capable model
Rate answers (thumbs up/down) to improve future suggestions

Next Steps

Resume, Context & Prompts — Configure your background and context
Advanced Mode — Dual-LLM question extraction

LLM Endpoint ​

Overview ​

Configuration ​

Default Values ​

Testing the Connection ​

Endpoint Options ​

Ollama (Recommended) ​

OpenWebUI ​

OpenAI API ​

Model Selection ​

For Ollama ​

For OpenAI ​

Advanced LLM Settings ​

Max Tokens ​

Thinking Mode ​

How Answers Are Generated ​

Message Format ​

Troubleshooting ​

Connection Failed ​

Out of Memory ​

Slow Responses ​

Poor Answer Quality ​

Next Steps ​

LLM Endpoint

Overview

Configuration

Default Values

Testing the Connection

Endpoint Options

Ollama (Recommended)

OpenWebUI

OpenAI API

Model Selection

For Ollama

For OpenAI

Advanced LLM Settings

Max Tokens

Thinking Mode

How Answers Are Generated

Message Format

Troubleshooting

Connection Failed

Out of Memory

Slow Responses

Poor Answer Quality

Next Steps