Finetuning a model means updating the model itself over a set of data to improve the model in a variety of ways. This can include improving the quality of outputs, reducing hallucinations, memorizing more data holistically, and reducing latency/cost.
The core of our toolkit revolves around in-context learning / retrieval augmentation, which involves using the models in inference mode and not training the models themselves.
While finetuning can be also used to “augment” a model with external data, finetuning can complement retrieval augmentation in a variety of ways:
Embedding Finetuning Benefits#
Finetuning the embedding model can allow for more meaningful embedding representations over a training distribution of data –> leads to better retrieval performance.
LLM Finetuning Benefits#
Allow it to learn a style over a given dataset
Allow it to learn a DSL that might be less represented in the training data (e.g. SQL)
Allow it to correct hallucinations/errors that might be hard to fix through prompt engineering
Allow it to distill a better model (e.g. GPT-4) into a simpler/cheaper model (e.g. gpt-3.5, Llama 2)
Integrations with LlamaIndex#
This is an evolving guide, and there are currently three key integrations with LlamaIndex. Please check out the sections below for more details!
Finetuning embeddings for better retrieval performance
Finetuning Llama 2 for better text-to-SQL
Finetuning gpt-3.5-turbo to distill gpt-4
We’ve created comprehensive guides showing you how to finetune embeddings in different ways, whether that’s the model itself (in this case,
bge) over an unstructured text corpus, or an adapter over any black-box embedding. It consists of the following steps:
Generating a synthetic question/answer dataset using LlamaIndex over any unstructured context.
Finetuning the model
Evaluating the model.
Finetuning gives you a 5-10% increase in retrieval evaluation metrics. You can then plug this fine-tuned model into your RAG application with LlamaIndex.
Fine-tuning GPT-3.5 to distill GPT-4#
We have multiple guides showing how to use OpenAI’s finetuning endpoints to fine-tune gpt-3.5-turbo to output GPT-4 responses for RAG/agents.
We use GPT-4 to automatically generate questions from any unstructured context, and use a GPT-4 query engine pipeline to generate “ground-truth” answers. Our
OpenAIFineTuningHandler callback automatically logs questions/answers to a dataset.
We then launch a finetuning job, and get back a distilled model. We can evaluate this model with Ragas to benchmark against a naive GPT-3.5 pipeline.
Fine-tuning with Retrieval Augmentation#
Here we try fine-tuning an LLM with retrieval-augmented inputs, as referenced from the RA-DIT paper: https://arxiv.org/abs/2310.01352.
The core idea is to allow the LLM to better use the context from a given retriever or ignore it entirely.
Fine-tuning for Better Structured Outputs#
Another use case for fine-tuning is to make the model better at outputting structured data. We can do this for both OpenAI and Llama2.
[WIP] Fine-tuning GPT-3.5 to Memorize Knowledge#
We have a guide experimenting with showing how to use OpenAI fine-tuning to memorize a body of text. Still WIP! Not quite as good as RAG yet.
Fine-tuning Llama 2 for Better Text-to-SQL#
In this tutorial, we show you how you can finetune Llama 2 on a text-to-SQL dataset, and then use it for structured analytics against any SQL database using LlamaIndex abstractions.
The stack includes
sql-create-context as the training dataset, OpenLLaMa as the base model, PEFT for finetuning, Modal for cloud compute, LlamaIndex for inference abstractions.
Fine-tuning An Evaluator#
In these tutorials, we aim to distill a GPT-4 judge (or evaluator) onto a GPT-3.5 judge. It has been recently observed that GPT-4 judges can reach high levels of agreement with human evaluators (e.g., see https://arxiv.org/pdf/2306.05685.pdf).
Thus, by fine-tuning a GPT-3.5 judge, we may be able to reach GPT-4 levels (and by proxy, agreement with humans) at a lower cost.
Fine-tuning Cross-Encoders for Re-Ranking#
By finetuning a cross encoder, we can attempt to improve re-ranking performance on our own private data.
Re-ranking is key step in advanced retrieval, where retrieved nodes from many sources are re-ranked using a separate model, so that the most relevant nodes are first.
In this example, we use the
sentence-transformers package to help finetune a crossencoder model, using a dataset that is generated based on the
Cohere Custom Reranker#
By training a custom reranker with CohereAI, we can attempt to improve re-ranking performance on our own private data.
Re-ranking is a crucial step in advanced retrieval processes. This step involves using a separate model to re-organize nodes retrieved from initial retrieval phase. The goal is to ensure that the most relevant nodes are prioritized and appear first.
In this example, we use the
cohere custom reranker training module to create a reranker on your domain or specific dataset to improve retrieval performance.