What is Ollama and Why Use It?
Ollama is an open-source framework that lets you run large language models directly on your local machine. Instead of sending data to OpenAI or Anthropic's servers, you can run models like Llama 4, Qwen 3, and DeepSeek locally—giving you complete privacy, no API costs after initial download, and full control over your AI setup.
With the recent release of Ollama supporting tool calling, vision models, and an OpenAI-compatible API, it's become a legitimate alternative to cloud-based LLMs for developers who need local inference. Whether you're building AI agents, experimenting with fine-tuned models, or just want to run AI without internet, Ollama deserves a spot in your toolkit.
Installation
Get Ollama running in minutes. Download from the official site for Windows, macOS, or Linux. On Linux or macOS, run:
curl -fsSL https://ollama.com/install.sh | sh
On Windows (PowerShell):
irm https://ollama.com/install.ps1 | iex
Verify installation by running ollama in your terminal—you should see the interactive menu.
Pulling and Running Your First Model
The simplest way to start is pulling a model and running it interactively:
ollama pull llama3.2— The popular 8B model (4.7GB download)ollama pull qwen3:0.6b— Small, fast model for testingollama pull deepseek-v3.2-exp:7b— Reasoning-focused model
Then run it:
ollama run llama3.2
That's it—you now have a local chatbot. Type your prompts directly in the terminal.
Using Ollama with Python
For programmatic use, install the Python library:
pip install ollama
Then generate text:
import ollama
response = ollama.generate(
model="llama3.2",
prompt="Explain quantum computing in simple terms"
)
print(response['response'])
For conversational interfaces, use the chat method which maintains message history:
import ollama
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to reverse a string."}
]
response = ollama.chat(model="llama3.2", messages=messages)
print(response['message']['content'])
OpenAI-Compatible API
Start the Ollama server and use it as a drop-in replacement for OpenAI:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Dummy key
)
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
This works with LangChain, AutoGen, CrewAI, and any library expecting OpenAI-format endpoints.
Tool Calling with Ollama
Models like llama3.2 support tool calling—the core of AI agents:
import ollama
def calculate_square_root(x):
return x ** 0.5
tools = [{
"type": "function",
"function": {
"name": "calculate_square_root",
"description": "Calculate the square root of a number",
"parameters": {
"type": "object",
"properties": {
"x": {"type": "number", "description": "The number"}
},
"required": ["x"]
}
}
}]
messages = [{"role": "user", "content": "What's the square root of 144?"}]
response = ollama.chat(model="llama3.2", messages=messages, tools=tools)
if 'tool_calls' in response['message']:
args = response['message']['tool_calls'][0]['function']['arguments']
result = calculate_square_root(args['x'])
messages.append({"role": "tool", "content": str(result)})
final = ollama.chat(model="llama3.2", messages=messages)
print(final['message']['content'])
Performance Tuning
Control GPU offloading with environment variables:
OLLAMA_NUM_GPU=35— Offload 35 layers to GPUOLLAMA_NUM_GPU=0— CPU-only mode
Choose quantization levels (Q4_0 for speed, Q8_0 for quality). The default Q4_K_M gives ~95% quality at 25% VRAM usage.
Creating Custom Models
Use a Modelfile to specialize models:
FROM llama3
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "You are a Senior Python Backend Engineer. Only answer with code."
Create it:
ollama create my-coder -f Modelfile
Then run your customized model:
ollama run my-coder
Essential Commands
ollama list— Show downloaded modelsollama pull <model>— Download a modelollama rm <model>— Delete to free spaceollama ps— Show currently running models
What You Need to Know
Ollama runs locally on port 11434 by default. The API is OpenAI-compatible, meaning most existing code works with minimal changes. Models download once and stay on your machine—no per-token costs. For privacy-sensitive work or offline development, this is a genuine alternative to cloud APIs.