VRAM allocation: Qwen2.5-32B IQ4_XS + 64K context
~14 GB weights
~8 GB KV-cache
0 GB
22 GB used
24 GB total
Model weights
KV-cache (64K ctx)
Free (~2 GB)
⚙ Recommended Architecture
Primary
Qwen2.5-32B
IQ4_XS / llama-server
64K ctx / ~15-20 tok/s
➔
Fallback #1
Mistral Small 3.1
24B / Ollama
32K ctx / ~30-50 tok/s
➔
Fallback #2
Claude Haiku
Cloud / Anthropic API
200K ctx / critical tasks
★ Model Comparison for RTX 3090
Best Overall
Qwen2.5-32B IQ4_XS
VRAM (64K)~22-23 GB
Native Context131K
Speed15-20 tok/s
Tool CallingReliable
Fits 24 GB?Yes
Fastest
Mistral Small 3.1 24B
VRAM (64K)~20-22 GB
Native Context128K
Speed30-50 tok/s
Tool CallingExcellent
Fits 24 GB?Yes
Best for Code
Qwen2.5-Coder-32B Q4
VRAM (64K)~22 GB
Native Context131K
Speed10-20 tok/s
Tool CallingExcellent
Fits 24 GB?Yes
Does Not Fit
Llama 3.3 70B Q4
VRAM (64K)>40 GB
Native Context128K
Speed (offload)0.3-0.5 tok/s
Tool CallingGood
Fits 24 GB?No
⚡ Generation Speed (tok/s @ 64K context)
⚠ Known Issues & Fixes
Ollama Streaming Breaks Tool Calling
Issues #5769, #9632: tool_calls deltas not sent during streaming. Fix: use llama-server instead, or set OLLAMA_DISABLE_STREAMING=true
Ollama Ignores num_ctx
In OpenAI-compat mode, context defaults to 4096. Fix: set "injectNumCtxForOpenAICompat": true in openclaw.json or use native Ollama API
Context Overflow
Local models degrade fast when context fills up. Enable auto-compaction at 80% threshold. Use /compact command manually when needed
Flash Attention Required
Always use --flash-attn with llama-server. Saves 20-40% VRAM on large contexts. Critical for fitting 32B + 64K in 24GB
✓ Local LLM Compatibility
Works Locally
- File operations & workspace
- Persistent memory
- Shell commands & tools
- Telegram / Slack / Discord
- Sandboxed execution
- Code generation (Qwen-Coder)
- Cron jobs & heartbeat
- Basic web browsing
Needs Cloud LLM
- Complex multi-step reasoning
- Email / calendar integration
- Prompt injection resistance
- Untrusted web content parsing
- Vision / multimodal tasks
- Financial operations
- Tasks >32K tokens context
- Reliable JSON from web search
▶ Quick Start
huggingface-cli download bartowski/Qwen2.5-32B-Instruct-GGUF --include "*IQ4_XS*"
llama-server \
-m Qwen2.5-32B-Instruct-IQ4_XS.gguf \
--ctx-size 65536 \
--n-gpu-layers 65 \
--flash-attn \
--port 8080
openclaw gateway