LLM Connection Errors
Problems connecting to the LLM endpoint or generating answers.
Symptoms
- "Failed to start generation" error
- No answers being generated
- Slow answer generation
- Connection timeout errors
- "LLM endpoint not configured" message
Connection Errors
LLM Not Running
Error:
Failed to start generation: Connection refusedFor Ollama:
bash
# Check if running
curl http://localhost:11434/api/tags
# Start if not running
ollama serve
# Pull model if missing
ollama pull llama3.2:3bFor Speaches:
bash
# Check if running
docker ps | grep speaches
# Start if not running
docker compose up -dWrong Endpoint
Common mistakes:
| Wrong | Correct |
|---|---|
http://localhost:11434 | http://localhost:11434/v1/chat/completions |
http://localhost:8000 | http://localhost:8000/v1/audio/transcriptions |
https://api.openai.com | https://api.openai.com/v1/chat/completions |
Include /v1/chat/completions
The endpoint must end with /v1/chat/completions for OpenAI compatibility.
API Key Issues
For OpenAI API:
- Ensure API key is valid
- Check if key has sufficient credits
- Verify key has
chatpermissions
Error messages:
401 Unauthorized → Invalid API key
429 Too Many Requests → Rate limited
402 Payment Required → Insufficient creditsFor OpenWebUI:
- Generate API key in Settings → Account
- Copy the key exactly (no extra spaces)
- Verify the endpoint URL
Testing LLM Connection
In Intervu
- Open Settings (gear icon)
- Configure LLM endpoint and model
- Click Test LLM
- Should see "Success: [response]"
Manual Test
bash
# Test Ollama
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Say hello"}]
}'
# Test OpenAI API
curl https://api.openai.com/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "Say hello"}]
}'Slow Generation
Causes
| Cause | Solution |
|---|---|
| Large model | Use smaller model |
| No GPU | Enable GPU acceleration |
| Slow network | Use local LLM |
| Long context | Shorten resume |
| Max tokens too high | Reduce in Advanced Settings |
Solutions
Use smaller model:
bash
# Instead of llama3.2:latest (larger)
# Use llama3.2:3b or phi3:mini (smaller, faster)
ollama pull phi3:miniEnable GPU for Ollama:
bash
# Check GPU is detected
nvidia-smi
# Ollama auto-detects GPU
# If not, set environment variable
OLLAMA_GPU_LAYERS=35 ollama serveReduce context:
- Shorten your resume to key points
- Remove outdated experience
- Keep system prompt concise
Out of Memory
Error Messages
CUDA out of memoryOOM (Out Of Memory)Solutions
Close other applications:
- Free up RAM
- Free up VRAM
Use smaller model:
bash
# 70B model → 3B model
ollama pull llama3.2:3b
# Or even smaller
ollama pull phi3:miniReduce GPU layers:
bash
# Ollama config
OLLAMA_GPU_LAYERS=20 ollama serveUse CPU-only mode:
bash
# Slower but uses less VRAM
OLLAMA_GPU_LAYERS=0 ollama serveAnswer Quality Issues
Generic Answers
Causes:
- Resume not detailed enough
- System prompt too vague
- Model too small
Solutions:
- Expand resume with specific achievements
- Customize system prompt
- Use larger model
- Rateanswers to improve future generations
Irrelevant Answers
Causes:
- Wrong context in transcript
- Model hallucinating
- Poor question detection
Solutions:
- Enable Advanced Mode for question extraction
- Use larger model
- Check transcript is accurate
- Rate down irrelevant answers
Incomplete Answers
Causes:
- Max tokens too low
- Connection interrupted
- Model stopping early
Solutions:
- Increase Max Tokens in Advanced Settings
- Check network stability
- Use "regenerate" button
Streaming Issues
Answer Not Streaming
Expected: Answer appears word by word
If answer appears all at once:
- Streaming may be disabled by model
- Network buffering
- Endpoint doesn't support streaming
For Ollama:
bash
# Verify streaming works
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Count to ten"}],
"stream": true
}'Should show data streaming line by line.
Connection Drops Mid-Stream
Causes:
- Network instability
- Server timeout
- Context too long
Solutions:
- Use local LLM
- Increase server timeout
- Shorten resume
Advanced Mode Issues
Extractor LLM Fails
Error:
Extraction failed, falling back to simple modeCauses:
- Extractor endpoint incorrect
- Extractor model not available
- Insufficient context
Solutions:
- Verify extractor endpoint
- Check extractor model is pulled/downloaded
- Use same endpoint for extraction and answering
No Questions Extracted
Causes:
- Model not following format
- Interviewer hasn't asked a question
- Extractor model too small
Solutions:
- Use larger model for extraction
- Wait for complete question
- Customize extractor system prompt
Testing Complete Flow
End-to-End Test
bash
# 1. Verify STT
curl http://localhost:8000/v1/models
# 2. Verify LLM
curl http://localhost:11434/api/tags
# 3. Test in Intervu Settings
# - Click "Test STT"
# - Click "Test LLM"
# - Both should succeed
# 4. Startcapture
# - Select audio devices
# - Click microphone button
# - Speak into mic
# - Verify transcript appears
# 5. Verify answer generation
# - Speak as interviewer
# - Wait for debounce
# - Answer should stream inStill Not Working?
Check Logs
- Open Settings → Open Logs Folder
- Check
app.logfor errors
Reset Everything
bash
# Stop services
docker stop speaches
ollama stop
# Restart services
docker start speaches
ollama serve
# Restart Intervu and re-configure endpointsWindows:
powershell
# Delete all Intervu data
rmdir /s /q %APPDATA%\intervu
# Restart Intervu and reconfiguremacOS:
bash
# Delete all Intervu data
rm -rf ~/Library/Application\ Support/intervu
# Restart Intervu and reconfigureContact Support
If issues persist:
- Export logs from Settings
- Note your configuration:
- LLM endpoint
- Model used
- System specs
- Report the issue
Next Steps
- LLM Endpoint — LLM configuration
- Basic Mode — Usage guide