AI Haven
AI News

How to Run Local LLMs with Ollama: A Complete 2026 Tutorial

A practical guide to running local LLMs with Ollama in 2026. Covers installation, Python integration, tool calling, and OpenAI-compatible API setup.

March 24, 2026

What is Ollama and Why Use It?

Ollama is an open-source framework that lets you run large language models directly on your local machine. Instead of sending data to OpenAI or Anthropic's servers, you can run models like Llama 4, Qwen 3, and DeepSeek locally—giving you complete privacy, no API costs after initial download, and full control over your AI setup.

With the recent release of Ollama supporting tool calling, vision models, and an OpenAI-compatible API, it's become a legitimate alternative to cloud-based LLMs for developers who need local inference. Whether you're building AI agents, experimenting with fine-tuned models, or just want to run AI without internet, Ollama deserves a spot in your toolkit.

Installation

Get Ollama running in minutes. Download from the official site for Windows, macOS, or Linux. On Linux or macOS, run:

curl -fsSL https://ollama.com/install.sh | sh

On Windows (PowerShell):

irm https://ollama.com/install.ps1 | iex

Verify installation by running ollama in your terminal—you should see the interactive menu.

Pulling and Running Your First Model

The simplest way to start is pulling a model and running it interactively:

  • ollama pull llama3.2 — The popular 8B model (4.7GB download)
  • ollama pull qwen3:0.6b — Small, fast model for testing
  • ollama pull deepseek-v3.2-exp:7b — Reasoning-focused model

Then run it:

ollama run llama3.2

That's it—you now have a local chatbot. Type your prompts directly in the terminal.

Using Ollama with Python

For programmatic use, install the Python library:

pip install ollama

Then generate text:

import ollama

response = ollama.generate(
    model="llama3.2",
    prompt="Explain quantum computing in simple terms"
)
print(response['response'])

For conversational interfaces, use the chat method which maintains message history:

import ollama

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to reverse a string."}
]

response = ollama.chat(model="llama3.2", messages=messages)
print(response['message']['content'])

OpenAI-Compatible API

Start the Ollama server and use it as a drop-in replacement for OpenAI:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Dummy key
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

This works with LangChain, AutoGen, CrewAI, and any library expecting OpenAI-format endpoints.

Tool Calling with Ollama

Models like llama3.2 support tool calling—the core of AI agents:

import ollama

def calculate_square_root(x):
    return x ** 0.5

tools = [{
    "type": "function",
    "function": {
        "name": "calculate_square_root",
        "description": "Calculate the square root of a number",
        "parameters": {
            "type": "object",
            "properties": {
                "x": {"type": "number", "description": "The number"}
            },
            "required": ["x"]
        }
    }
}]

messages = [{"role": "user", "content": "What's the square root of 144?"}]
response = ollama.chat(model="llama3.2", messages=messages, tools=tools)

if 'tool_calls' in response['message']:
    args = response['message']['tool_calls'][0]['function']['arguments']
    result = calculate_square_root(args['x'])
    messages.append({"role": "tool", "content": str(result)})
    final = ollama.chat(model="llama3.2", messages=messages)
    print(final['message']['content'])

Performance Tuning

Control GPU offloading with environment variables:

  • OLLAMA_NUM_GPU=35 — Offload 35 layers to GPU
  • OLLAMA_NUM_GPU=0 — CPU-only mode

Choose quantization levels (Q4_0 for speed, Q8_0 for quality). The default Q4_K_M gives ~95% quality at 25% VRAM usage.

Creating Custom Models

Use a Modelfile to specialize models:

FROM llama3
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "You are a Senior Python Backend Engineer. Only answer with code."

Create it:

ollama create my-coder -f Modelfile

Then run your customized model:

ollama run my-coder

Essential Commands

  • ollama list — Show downloaded models
  • ollama pull <model> — Download a model
  • ollama rm <model> — Delete to free space
  • ollama ps — Show currently running models

What You Need to Know

Ollama runs locally on port 11434 by default. The API is OpenAI-compatible, meaning most existing code works with minimal changes. Models download once and stay on your machine—no per-token costs. For privacy-sensitive work or offline development, this is a genuine alternative to cloud APIs.

Source: Real PythonView original →