OpenClaw + Local LLM on RTX 3090

RTX

NVIDIA GeForce RTX 3090

24 GB GDDR6X / 936 GB/s / 10496 CUDA cores

VRAM allocation: Qwen2.5-32B IQ4_XS + 64K context

~14 GB weights

~8 GB KV-cache

0 GB 22 GB used 24 GB total

Model weights KV-cache (64K ctx) Free (~2 GB)

⚙ Recommended Architecture

Primary

Qwen2.5-32B

IQ4_XS / llama-server
64K ctx / ~15-20 tok/s

➔

Fallback #1

Mistral Small 3.1

24B / Ollama
32K ctx / ~30-50 tok/s

➔

Fallback #2

Claude Haiku

Cloud / Anthropic API
200K ctx / critical tasks

★ Model Comparison for RTX 3090

Best Overall

Qwen2.5-32B IQ4_XS

VRAM (64K)~22-23 GB

Native Context131K

Speed15-20 tok/s

Tool CallingReliable

Fits 24 GB?Yes

Fastest

Mistral Small 3.1 24B

VRAM (64K)~20-22 GB

Native Context128K

Speed30-50 tok/s

Tool CallingExcellent

Fits 24 GB?Yes

Best for Code

Qwen2.5-Coder-32B Q4

VRAM (64K)~22 GB

Native Context131K

Speed10-20 tok/s

Tool CallingExcellent

Fits 24 GB?Yes

Does Not Fit

Llama 3.3 70B Q4

VRAM (64K)>40 GB

Native Context128K

Speed (offload)0.3-0.5 tok/s

Tool CallingGood

Fits 24 GB?No

⚡ Generation Speed (tok/s @ 64K context)

Llama 3.1 8B Q8

60-80 tok/s

Mistral Small 3.1 24B

30-50 tok/s

Qwen2.5-32B IQ4_XS

15-20 tok/s

Qwen2.5-Coder-32B Q4

10-20 tok/s

⚠ Known Issues & Fixes

Ollama Streaming Breaks Tool Calling

Issues #5769, #9632: tool_calls deltas not sent during streaming. Fix: use llama-server instead, or set OLLAMA_DISABLE_STREAMING=true

Ollama Ignores num_ctx

In OpenAI-compat mode, context defaults to 4096. Fix: set "injectNumCtxForOpenAICompat": true in openclaw.json or use native Ollama API

Context Overflow

Local models degrade fast when context fills up. Enable auto-compaction at 80% threshold. Use /compact command manually when needed

Flash Attention Required

Always use --flash-attn with llama-server. Saves 20-40% VRAM on large contexts. Critical for fitting 32B + 64K in 24GB

✓ Local LLM Compatibility

Works Locally

File operations & workspace
Persistent memory
Shell commands & tools
Telegram / Slack / Discord
Sandboxed execution
Code generation (Qwen-Coder)
Cron jobs & heartbeat
Basic web browsing

Needs Cloud LLM

Complex multi-step reasoning
Email / calendar integration
Prompt injection resistance
Untrusted web content parsing
Vision / multimodal tasks
Financial operations
Tasks >32K tokens context
Reliable JSON from web search

▶ Quick Start

# 1. Download model
huggingface-cli download bartowski/Qwen2.5-32B-Instruct-GGUF --include "*IQ4_XS*"

# 2. Start llama-server
llama-server \
  -m Qwen2.5-32B-Instruct-IQ4_XS.gguf \
  --ctx-size 65536 \
  --n-gpu-layers 65 \
  --flash-attn \
  --port 8080

# 3. Configure OpenClaw
# Set primary model in ~/.openclaw/openclaw.json:
# "primary": "llama_cpp/qwen2.5-32b"
# "baseUrl": "http://127.0.0.1:8080/v1"

# 4. Start gateway
openclaw gateway