Ollama + `gpt-oss` Cookbook¶

OpenAI's latest open-source models, gpt-oss, have been released.

They come in two sizes:

20 billion parameter model
120 billion parameter model

These models are Apache 2.0 licensed, and can be run locally on your machine. In this cookbook, we will use Ollama to demonstrate capabilities and test some claims of agentic and chain-of-thought behavior.

Setup¶

First, follow the readme to set up and run a local Ollama instance.

When the Ollama app is running on your local machine:

All of your local models are automatically served on localhost:11434
Select your model when setting llm = Ollama(..., model=":")
Increase defaullt timeout (30 seconds) if needed setting Ollama(..., request_timeout=300.0)
If you set llm = Ollama(..., model="<model family") without a version it will simply look for latest
By default, the maximum context window for your model is used. You can manually set the context_window to limit memory usage.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [ ]:

Copied!

%pip install llama-index-llms-ollama
%pip install llama-index-llms-ollama

Chain-of-thought / Thinking with `gpt-oss`¶

Ollama supports configuration for thinking when using gpt-oss models. Let's test this out with a few examples.

In [ ]:

Copied!





from llama_index.llms.ollama import Ollama

llm = Ollama(
    model="gpt-oss:20b",
    request_timeout=360,
    thinking=True,
    temperature=1.0,
    # Supports up to 130K tokens, lowering to save memory
    context_window=8000,
)
from llama_index.llms.ollama import Ollama

llm = Ollama(
    model="gpt-oss:20b",
    request_timeout=360,
    thinking=True,
    temperature=1.0,
    # Supports up to 130K tokens, lowering to save memory
    context_window=8000,
)

In [ ]:

Copied!





resp_gen = await llm.astream_complete("What is 1234 * 5678?")

still_thinking = True
print("====== THINKING ======")
async for chunk in resp_gen:
    if still_thinking and chunk.additional_kwargs.get("thinking_delta"):
        print(chunk.additional_kwargs["thinking_delta"], end="", flush=True)
    elif still_thinking:
        still_thinking = False
        print("\n====== ANSWER ======")

    if not still_thinking:
        print(chunk.delta, end="", flush=True)
resp_gen = await llm.astream_complete("What is 1234 * 5678?")

still_thinking = True
print("====== THINKING ======")
async for chunk in resp_gen:
    if still_thinking and chunk.additional_kwargs.get("thinking_delta"):
        print(chunk.additional_kwargs["thinking_delta"], end="", flush=True)
    elif still_thinking:
        still_thinking = False
        print("\n====== ANSWER ======")

    if not still_thinking:
        print(chunk.delta, end="", flush=True)

====== THINKING ======
We need to multiply 1234 by 5678. Let's compute: 1234 * 5678. Use long multiplication or mental: 1234 * 5678 = ?

Compute 5678 * 1234. 5678 * 1000 = 5,678,000. 5678 * 200 = 1,135,600. 5678 * 30 = 170,340. 5678 * 4 = 22,712. Sum: 5,678,000 + 1,135,600 = 6,813,600. +170,340 = 6,983,940. +22,712 = 7,006,652. Let's verify: Another way: 1234*5678 = (1200+34)*(5678) = 1200*5678 + 34*5678. 1200*5678= 5678*12*100 = 68,136*100? Wait 5678*12 = 5678*10 + 5678*2 = 56,780 + 11,356 = 68,136. times 100 = 6,813,600. 34*5678 = 5678*30 + 5678*4 = 170,340 + 22,712 = 193,052. Sum 6,813,600 + 193,052 = 7,006,652. Yes.

Thus answer is 7,006,652.
====== ANSWER ======
\(1234 \times 5678 = 7{,}006{,}652\).

Creating agents with `gpt-oss`¶

While giving a response from a prompt is fine, we can also incorporate tools to get more precise results, and build an agent.

In [ ]:

Copied!





from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.ollama import Ollama


def multiply(a: int, b: int) -> int:
    """Multiply two numbers"""
    return a * b


llm = Ollama(
    model="gpt-oss:20b",
    request_timeout=360,
    thinking=False,
    temperature=1.0,
    # Supports up to 130K tokens, lowering to save memory
    context_window=8000,
)

agent = FunctionAgent(
    tools=[multiply],
    llm=llm,
    system_prompt="You are a helpful assistant that can multiply and add numbers. Always rely on tools for math operations.",
)
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.ollama import Ollama


def multiply(a: int, b: int) -> int:
    """Multiply two numbers"""
    return a * b


llm = Ollama(
    model="gpt-oss:20b",
    request_timeout=360,
    thinking=False,
    temperature=1.0,
    # Supports up to 130K tokens, lowering to save memory
    context_window=8000,
)

agent = FunctionAgent(
    tools=[multiply],
    llm=llm,
    system_prompt="You are a helpful assistant that can multiply and add numbers. Always rely on tools for math operations.",
)

In [ ]:

Copied!





from llama_index.core.agent.workflow import (
    ToolCall,
    ToolCallResult,
    AgentStream,
)

handler = agent.run("What is 1234 * 5678?")
async for ev in handler.stream_events():
    if isinstance(ev, ToolCall):
        print(f"\nTool call: {ev.tool_name}({ev.tool_kwargs}")
    elif isinstance(ev, ToolCallResult):
        print(
            f"\nTool call: {ev.tool_name}({ev.tool_kwargs}) -> {ev.tool_output}"
        )
    elif isinstance(ev, AgentStream):
        print(ev.delta, end="", flush=True)

resp = await handler
from llama_index.core.agent.workflow import (
    ToolCall,
    ToolCallResult,
    AgentStream,
)

handler = agent.run("What is 1234 * 5678?")
async for ev in handler.stream_events():
    if isinstance(ev, ToolCall):
        print(f"\nTool call: {ev.tool_name}({ev.tool_kwargs}")
    elif isinstance(ev, ToolCallResult):
        print(
            f"\nTool call: {ev.tool_name}({ev.tool_kwargs}) -> {ev.tool_output}"
        )
    elif isinstance(ev, AgentStream):
        print(ev.delta, end="", flush=True)

resp = await handler

Tool call: multiply({'a': 1234, 'b': 5678}

Tool call: multiply({'a': 1234, 'b': 5678}) -> 7006652
The product is **7,006,652**.

Remembering past events with Agents¶

By default, agent runs do not remember past events. However, using the Context, we can maintain state between calls.

In [ ]:

Copied!

from llama_index.core.workflow import Context

ctx = Context(agent)

resp = await agent.run("What is 1234 * 5678?", ctx=ctx)
resp = await agent.run("What was the last question/answer pair?", ctx=ctx)
from llama_index.core.workflow import Context

ctx = Context(agent)

resp = await agent.run("What is 1234 * 5678?", ctx=ctx)
resp = await agent.run("What was the last question/answer pair?", ctx=ctx)

In [ ]:

Copied!

print(resp.response.content)
print(resp.response.content)

**Last question:**  
*“What is 1234 * 5678?”*  

**Answer:**  
*The product of 1234 and 5678 is 7,006,652.*

Ollama + gpt-oss Cookbook¶

Setup¶

Chain-of-thought / Thinking with gpt-oss¶

Creating agents with gpt-oss¶

Remembering past events with Agents¶

Ollama + `gpt-oss` Cookbook¶

Chain-of-thought / Thinking with `gpt-oss`¶

Creating agents with `gpt-oss`¶