Ollama + gpt-oss
Cookbook¶
OpenAI's latest open-source models, gpt-oss
, have been released.
They come in two sizes:
- 20 billion parameter model
- 120 billion parameter model
These models are Apache 2.0 licensed, and can be run locally on your machine. In this cookbook, we will use Ollama to demonstrate capabilities and test some claims of agentic and chain-of-thought behavior.
Setup¶
First, follow the readme to set up and run a local Ollama instance.
When the Ollama app is running on your local machine:
- All of your local models are automatically served on localhost:11434
- Select your model when setting llm = Ollama(..., model="
: ") - Increase defaullt timeout (30 seconds) if needed setting Ollama(..., request_timeout=300.0)
- If you set llm = Ollama(..., model="<model family") without a version it will simply look for latest
- By default, the maximum context window for your model is used. You can manually set the
context_window
to limit memory usage.
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
%pip install llama-index-llms-ollama
Chain-of-thought / Thinking with gpt-oss
¶
Ollama supports configuration for thinking when using gpt-oss
models. Let's test this out with a few examples.
from llama_index.llms.ollama import Ollama
llm = Ollama(
model="gpt-oss:20b",
request_timeout=360,
thinking=True,
temperature=1.0,
# Supports up to 130K tokens, lowering to save memory
context_window=8000,
)
resp_gen = await llm.astream_complete("What is 1234 * 5678?")
still_thinking = True
print("====== THINKING ======")
async for chunk in resp_gen:
if still_thinking and chunk.additional_kwargs.get("thinking_delta"):
print(chunk.additional_kwargs["thinking_delta"], end="", flush=True)
elif still_thinking:
still_thinking = False
print("\n====== ANSWER ======")
if not still_thinking:
print(chunk.delta, end="", flush=True)
====== THINKING ====== We need to multiply 1234 by 5678. Let's compute: 1234 * 5678. Use long multiplication or mental: 1234 * 5678 = ? Compute 5678 * 1234. 5678 * 1000 = 5,678,000. 5678 * 200 = 1,135,600. 5678 * 30 = 170,340. 5678 * 4 = 22,712. Sum: 5,678,000 + 1,135,600 = 6,813,600. +170,340 = 6,983,940. +22,712 = 7,006,652. Let's verify: Another way: 1234*5678 = (1200+34)*(5678) = 1200*5678 + 34*5678. 1200*5678= 5678*12*100 = 68,136*100? Wait 5678*12 = 5678*10 + 5678*2 = 56,780 + 11,356 = 68,136. times 100 = 6,813,600. 34*5678 = 5678*30 + 5678*4 = 170,340 + 22,712 = 193,052. Sum 6,813,600 + 193,052 = 7,006,652. Yes. Thus answer is 7,006,652. ====== ANSWER ====== \(1234 \times 5678 = 7{,}006{,}652\).
Creating agents with gpt-oss
¶
While giving a response from a prompt is fine, we can also incorporate tools to get more precise results, and build an agent.
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.ollama import Ollama
def multiply(a: int, b: int) -> int:
"""Multiply two numbers"""
return a * b
llm = Ollama(
model="gpt-oss:20b",
request_timeout=360,
thinking=False,
temperature=1.0,
# Supports up to 130K tokens, lowering to save memory
context_window=8000,
)
agent = FunctionAgent(
tools=[multiply],
llm=llm,
system_prompt="You are a helpful assistant that can multiply and add numbers. Always rely on tools for math operations.",
)
from llama_index.core.agent.workflow import (
ToolCall,
ToolCallResult,
AgentStream,
)
handler = agent.run("What is 1234 * 5678?")
async for ev in handler.stream_events():
if isinstance(ev, ToolCall):
print(f"\nTool call: {ev.tool_name}({ev.tool_kwargs}")
elif isinstance(ev, ToolCallResult):
print(
f"\nTool call: {ev.tool_name}({ev.tool_kwargs}) -> {ev.tool_output}"
)
elif isinstance(ev, AgentStream):
print(ev.delta, end="", flush=True)
resp = await handler
Tool call: multiply({'a': 1234, 'b': 5678} Tool call: multiply({'a': 1234, 'b': 5678}) -> 7006652 The product is **7,006,652**.
Remembering past events with Agents¶
By default, agent runs do not remember past events. However, using the Context
, we can maintain state between calls.
from llama_index.core.workflow import Context
ctx = Context(agent)
resp = await agent.run("What is 1234 * 5678?", ctx=ctx)
resp = await agent.run("What was the last question/answer pair?", ctx=ctx)
print(resp.response.content)
**Last question:** *“What is 1234 * 5678?”* **Answer:** *The product of 1234 and 5678 is 7,006,652.*