Ollama (Local Models)¶

Run Agent Smith entirely on your own hardware with zero API costs. Ollama serves open-source models locally through an OpenAI-compatible API.

Setup¶

1. Install and start Ollama:

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
ollama serve

Or via Docker:

docker run -d -p 11434:11434 --name ollama ollama/ollama

2. Pull a model:

ollama pull qwen2.5-coder:32b

3. Configure agentsmith.yml:

projects:
  my-api:
    agent:
      type: Ollama
      model: qwen2.5-coder:32b
      endpoint: http://localhost:11434    # Default, can be omitted

No API key required. No secrets section needed.

Recommended Models¶

Model	Size	Tool Calling	Best For
`qwen2.5-coder:32b`	18 GB	Yes	Code generation, best local coding model
`qwen2.5-coder:7b`	4.4 GB	Yes	Fast coding on modest hardware
`llama3.3:70b`	40 GB	Yes	General-purpose, strong reasoning
`mistral-small:24b`	14 GB	Yes	Good balance of speed and quality
`deepseek-r1:32b`	18 GB	No	Reasoning-heavy tasks (no tool calling)
`deepseek-r1:7b`	4.4 GB	No	Lightweight reasoning

Tool Calling Auto-Detection¶

Agent Smith automatically tests whether a model supports native tool calling at startup:

[INF] Connected to Ollama 0.6.2 at http://localhost:11434
[INF] Model qwen2.5-coder:32b: tool_calling=True

Native tools: The model receives tool definitions in OpenAI format and returns structured tool calls. Full agentic loop with file operations, search, and code editing.
Structured text fallback: Models without tool calling receive instructions to output structured text. The agent parses the response to extract actions.

Warning

Models without native tool calling (e.g., deepseek-r1) have limited agentic capability. They work for plan generation and analysis but cannot reliably execute multi-step code changes.

Model Routing¶

Mix model sizes for cost vs. quality:

agent:
  type: Ollama
  model: qwen2.5-coder:32b
  models:
    scout:
      model: qwen2.5-coder:7b       # Fast, lightweight for file discovery
      max_tokens: 4096
    primary:
      model: qwen2.5-coder:32b      # Full power for code execution
      max_tokens: 8192
    planning:
      model: qwen2.5-coder:32b
      max_tokens: 4096
    summarization:
      model: qwen2.5-coder:7b       # Small model for compaction
      max_tokens: 2048

Hybrid Cloud + Local¶

Use Ollama for cheap scouting and cloud for execution:

projects:
  my-api:
    agent:
      type: Claude
      model: claude-sonnet-4-20250514
      models:
        scout:
          model: qwen2.5-coder:7b     # Local, free
          max_tokens: 4096
        primary:
          model: claude-sonnet-4-20250514  # Cloud, high quality
          max_tokens: 8192
        planning:
          model: claude-sonnet-4-20250514
          max_tokens: 4096
        summarization:
          model: qwen2.5-coder:7b     # Local, free
          max_tokens: 2048

Note

Hybrid routing requires both Ollama running locally and a cloud API key configured. The provider type determines the primary execution path -- model routing within the models block can reference any available model.

Pricing¶

Ollama models run on your hardware at zero token cost:

agent:
  pricing:
    models:
      qwen2.5-coder:32b:
        input_per_million: 0.0
        output_per_million: 0.0
      qwen2.5-coder:7b:
        input_per_million: 0.0
        output_per_million: 0.0

Cost tracking still works -- it will show $0.00 for local models, which is useful when mixing local and cloud models to see where money is actually spent.

Hardware Requirements¶

Model Size	RAM Required	GPU VRAM	Notes
7B	8 GB	6 GB	Runs on most machines
14-24B	16 GB	12 GB	Good laptop GPU
32B	24 GB	20 GB	Desktop GPU (RTX 4090)
70B	48 GB	40 GB	Dual GPU or CPU-only (slow)

Tip

Ollama automatically uses GPU acceleration when available. For CPU-only setups, expect 5-10x slower inference. The 7B models are still usable on CPU; 32B+ models need a GPU for practical agentic loop speeds.

Troubleshooting¶

"Cannot connect to Ollama" -- Ensure Ollama is running:

ollama serve        # or: docker start ollama
curl http://localhost:11434/api/version

"Model not found" -- Pull the model first:

ollama pull qwen2.5-coder:32b
ollama list         # Verify it's available

Slow inference -- Check GPU utilization. If Ollama is using CPU:

ollama ps           # Shows running models and GPU usage