Configuring Settings#

The Settings is a bundle of commonly used resources used during the indexing and querying stage in a LlamaIndex pipeline/application.

You can use it to set the global configuration. Local configurations (transformations, LLMs, embedding models) can be passed directly into the interfaces that make use of them.

The Settings is a simple singleton object that lives throughout your application. Whenever a particular component is not provided, the Settings object is used to provide it as a global default.

The following attributes can be configured on the Settings object:


The LLM is used to respond to prompts and queries, and is responsible for writing natural language responses.

from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

Embed Model#

The embedding model is used to convert text to numerical representationss, used for calculating similarity and top-k retrieval.

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small", embed_batch_size=100

Node Parser / Text Splitter#

The node parser / text splitter is used to parse documents into smaller chunks, called nodes.

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings

Settings.text_splitter = SentenceSplitter(chunk_size=1024)

If you just want to change the chunk size or chunk overlap without changing the default splitter, this is also possible:

Settings.chunk_size = 512
Settings.chunk_overlap = 20


Transformations are applied to Documents during ingestion. By default, the node_parser/text_splitter is used, but this can be overridden and customized further.

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings

Settings.transformations = [SentenceSplitter(chunk_size=1024)]


The tokenizer is used to count tokens. This should be set to something that matches the LLM you are using.

from llama_index.core import Settings

# openai
import tiktoken

Settings.tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo").encode

# open-source
from transformers import AutoTokenizer

Settings.tokenzier = AutoTokenizer.from_pretrained(


You can set a global callback manager, which can be used to observe and consume events generated throughout the llama-index code

from llama_index.core.callbacks import TokenCountingHandler, CallbackManager
from llama_index.core import Settings

token_counter = TokenCountingHandler()
Settings.callback_manager = CallbackManager([token_counter])

Prompt Helper Arguments#

A few specific arguments/values are used during querying, to ensure that the input prompts to the LLM have enough room to generate a certain number of tokens.

Typically these are automatically configured using attributes from the LLM, but they can be overridden in special cases.

from llama_index.core import Settings

# maximum input size to the LLM
Settings.context_window = 4096

# number of tokens to leave room for the LLM to generate
Settings.num_output = 256


Learn how to configure specific modules:

Setting local configurations#

Interfaces that use specific parts of the settings can also accept local overrides.

index = VectorStoreIndex.from_documents(
    documents, embed_model=embed_model, transformations=transformations

query_engine = index.as_query_engine(llm=llm)