Finetuning#

Finetuning modules.

class llama_index.finetuning.CohereRerankerFinetuneEngine(train_file_name: str = 'train.jsonl', val_file_name: Optional[str] = None, model_name: str = 'exp_finetune', model_type: str = 'RERANK', base_model: str = 'english', api_key: Optional[str] = None)#

Cohere Reranker Finetune Engine.

finetune() None#

Finetune model.

get_finetuned_model(top_n: int = 5) CohereRerank#

Gets finetuned model id.

class llama_index.finetuning.EmbeddingAdapterFinetuneEngine(dataset: EmbeddingQAFinetuneDataset, embed_model: BaseEmbedding, batch_size: int = 10, epochs: int = 1, adapter_model: Optional[Any] = None, dim: Optional[int] = None, device: Optional[str] = None, model_output_path: str = 'model_output', model_checkpoint_path: Optional[str] = None, checkpoint_save_steps: int = 100, verbose: bool = False, bias: bool = False, **train_kwargs: Any)#

Embedding adapter finetune engine.

Parameters
  • dataset (EmbeddingQAFinetuneDataset) โ€“ Dataset to finetune on.

  • embed_model (BaseEmbedding) โ€“ Embedding model to finetune.

  • batch_size (Optional[int]) โ€“ Batch size. Defaults to 10.

  • epochs (Optional[int]) โ€“ Number of epochs. Defaults to 1.

  • dim (Optional[int]) โ€“ Dimension of embedding. Defaults to None.

  • adapter_model (Optional[BaseAdapter]) โ€“ Adapter model. Defaults to None, in which case a linear adapter is used.

  • device (Optional[str]) โ€“ Device to use. Defaults to None.

  • model_output_path (str) โ€“ Path to save model output. Defaults to โ€œmodel_outputโ€.

  • model_checkpoint_path (Optional[str]) โ€“ Path to save model checkpoints. Defaults to None (donโ€™t save checkpoints).

  • verbose (bool) โ€“ Whether to show progress bar. Defaults to False.

  • bias (bool) โ€“ Whether to use bias. Defaults to False.

finetune(**train_kwargs: Any) None#

Finetune.

classmethod from_model_path(dataset: EmbeddingQAFinetuneDataset, embed_model: BaseEmbedding, model_path: str, model_cls: Optional[Type[Any]] = None, **kwargs: Any) EmbeddingAdapterFinetuneEngine#

Load from model path.

Parameters
  • dataset (EmbeddingQAFinetuneDataset) โ€“ Dataset to finetune on.

  • embed_model (BaseEmbedding) โ€“ Embedding model to finetune.

  • model_path (str) โ€“ Path to model.

  • model_cls (Optional[Type[Any]]) โ€“ Adapter model class. Defaults to None.

  • **kwargs (Any) โ€“ Additional kwargs (see __init__)

get_finetuned_model(**model_kwargs: Any) BaseEmbedding#

Get finetuned model.

smart_batching_collate(batch: List) Tuple[Any, Any]#

Smart batching collate.

pydantic model llama_index.finetuning.EmbeddingQAFinetuneDataset#

Embedding QA Finetuning Dataset.

Parameters
  • queries (Dict[str, str]) โ€“ Dict id -> query.

  • corpus (Dict[str, str]) โ€“ Dict id -> string.

  • relevant_docs (Dict[str, List[str]]) โ€“ Dict query id -> list of doc ids.

Show JSON schema
{
   "title": "EmbeddingQAFinetuneDataset",
   "description": "Embedding QA Finetuning Dataset.\n\nArgs:\n    queries (Dict[str, str]): Dict id -> query.\n    corpus (Dict[str, str]): Dict id -> string.\n    relevant_docs (Dict[str, List[str]]): Dict query id -> list of doc ids.",
   "type": "object",
   "properties": {
      "queries": {
         "title": "Queries",
         "type": "object",
         "additionalProperties": {
            "type": "string"
         }
      },
      "corpus": {
         "title": "Corpus",
         "type": "object",
         "additionalProperties": {
            "type": "string"
         }
      },
      "relevant_docs": {
         "title": "Relevant Docs",
         "type": "object",
         "additionalProperties": {
            "type": "array",
            "items": {
               "type": "string"
            }
         }
      },
      "mode": {
         "title": "Mode",
         "default": "text",
         "type": "string"
      }
   },
   "required": [
      "queries",
      "corpus",
      "relevant_docs"
   ]
}

Fields
  • corpus (Dict[str, str])

  • mode (str)

  • queries (Dict[str, str])

  • relevant_docs (Dict[str, List[str]])

field corpus: Dict[str, str] [Required]#
field mode: str = 'text'#
field queries: Dict[str, str] [Required]#
field relevant_docs: Dict[str, List[str]] [Required]#
classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model#

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = โ€˜allowโ€™ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model#

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include โ€“ fields to include in new model

  • exclude โ€“ fields to exclude from new model, as with values this takes precedence over include

  • update โ€“ values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep โ€“ set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny#

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_json(path: str) EmbeddingQAFinetuneDataset#

Load json.

classmethod from_orm(obj: Any) Model#
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode#

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model#
classmethod parse_obj(obj: Any) Model#
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model#
save_json(path: str) None#

Save json.

classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny#
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode#
classmethod update_forward_refs(**localns: Any) None#

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model#
property query_docid_pairs: List[Tuple[str, List[str]]]#

Get query, relevant doc ids.

class llama_index.finetuning.GradientFinetuneEngine(*, access_token: Optional[str] = None, base_model_slug: str, data_path: str, host: Optional[str] = None, learning_rate: Optional[float] = None, name: str, rank: Optional[int] = None, workspace_id: Optional[str] = None)#
class llama_index.finetuning.GradientFinetuneEngine(*, access_token: Optional[str] = None, data_path: str, host: Optional[str] = None, model_adapter_id: str, workspace_id: Optional[str] = None)
finetune() None#

Goes off and does stuff.

get_finetuned_model(**model_kwargs: Any) GradientModelAdapterLLM#

Gets finetuned model.

class llama_index.finetuning.OpenAIFinetuneEngine(base_model: str, data_path: str, verbose: bool = False, start_job_id: Optional[str] = None, validate_json: bool = True)#

OpenAI Finetuning Engine.

finetune() None#

Finetune model.

classmethod from_finetuning_handler(finetuning_handler: OpenAIFineTuningHandler, base_model: str, data_path: str, **kwargs: Any) OpenAIFinetuneEngine#

Initialize from finetuning handler.

Used to finetune an OpenAI model into another OpenAI model (e.g. gpt-3.5-turbo on top of GPT-4).

get_current_job() FineTuningJob#

Get current job.

get_finetuned_model(**model_kwargs: Any) LLM#

Gets finetuned model.

class llama_index.finetuning.SentenceTransformersFinetuneEngine(dataset: EmbeddingQAFinetuneDataset, model_id: str = 'BAAI/bge-small-en', model_output_path: str = 'exp_finetune', batch_size: int = 10, val_dataset: Optional[EmbeddingQAFinetuneDataset] = None, loss: Optional[Any] = None, epochs: int = 2, show_progress_bar: bool = True, evaluation_steps: int = 50, use_all_docs: bool = False)#

Sentence Transformers Finetune Engine.

finetune(**train_kwargs: Any) None#

Finetune model.

get_finetuned_model(**model_kwargs: Any) BaseEmbedding#

Gets finetuned model.

llama_index.finetuning.generate_qa_embedding_pairs(nodes: List[TextNode], llm: LLM, qa_generate_prompt_tmpl: str = 'Context information is below.\n\n---------------------\n{context_str}\n---------------------\n\nGiven the context information and not prior knowledge.\ngenerate only questions based on the below query.\n\nYou are a Teacher/ Professor. Your task is to setup {num_questions_per_chunk} questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. Restrict the questions to the context information provided."\n', num_questions_per_chunk: int = 2) EmbeddingQAFinetuneDataset#

Generate examples given a set of nodes.