Evaluation#
We have modules for both LLM-based evaluation and retrieval-based evaluation.
Evaluation modules.
- class llama_index.evaluation.AnswerConsistencyBinaryEvaluator(openai_service: Optional[Any] = None)#
Tonic Validate’s answer consistency binary metric.
The output score is a float that is either 0.0 or 1.0.
See https://docs.tonic.ai/validate/ for more details.
- Parameters
openai_service (OpenAIService) – The OpenAI service to use. Specifies the chat completion model to use as the LLM evaluator. Defaults to “gpt-4”.
- async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- class llama_index.evaluation.AnswerConsistencyEvaluator(openai_service: Optional[Any] = None)#
Tonic Validate’s answer consistency metric.
The output score is a float between 0.0 and 1.0.
See https://docs.tonic.ai/validate/ for more details.
- Parameters
openai_service (OpenAIService) – The OpenAI service to use. Specifies the chat completion model to use as the LLM evaluator. Defaults to “gpt-4”.
- async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- class llama_index.evaluation.AnswerRelevancyEvaluator(service_context: llama_index.service_context.ServiceContext | None = None, raise_error: bool = False, eval_template: str | llama_index.prompts.base.BasePromptTemplate | None = None, score_threshold: float = 2.0, parser_function: ~typing.Callable[[str], ~typing.Tuple[~typing.Optional[float], ~typing.Optional[str]]] = <function _default_parser_function>)#
Answer relevancy evaluator.
Evaluates the relevancy of response to a query. This evaluator considers the query string and response string.
- Parameters
service_context (Optional[ServiceContext]) – The service context to use for evaluation.
raise_error (Optional[bool]) – Whether to raise an error if the response is invalid. Defaults to False.
eval_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for evaluation.
refine_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for refinement.
- async aevaluate(query: str | None = None, response: str | None = None, contexts: Optional[Sequence[str]] = None, sleep_time_in_seconds: int = 0, **kwargs: Any) EvaluationResult #
Evaluate whether the response is relevant to the query.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- class llama_index.evaluation.AnswerSimilarityEvaluator(openai_service: Optional[Any] = None)#
Tonic Validate’s answer similarity metric.
The output score is a float between 0.0 and 5.0.
See https://docs.tonic.ai/validate/ for more details.
- Parameters
openai_service (OpenAIService) – The OpenAI service to use. Specifies the chat completion model to use as the LLM evaluator. Defaults to “gpt-4”.
- async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, reference_response: Optional[str] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- class llama_index.evaluation.AugmentationAccuracyEvaluator(openai_service: Optional[Any] = None)#
Tonic Validate’s augmentation accuracy metric.
The output score is a float between 0.0 and 1.0.
See https://docs.tonic.ai/validate/ for more details.
- Parameters
openai_service (OpenAIService) – The OpenAI service to use. Specifies the chat completion model to use as the LLM evaluator. Defaults to “gpt-4”.
- async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- class llama_index.evaluation.AugmentationPrecisionEvaluator(openai_service: Optional[Any] = None)#
Tonic Validate’s augmentation precision metric.
The output score is a float between 0.0 and 1.0.
See https://docs.tonic.ai/validate/ for more details.
- Parameters
openai_service (OpenAIService) – The OpenAI service to use. Specifies the chat completion model to use as the LLM evaluator. Defaults to “gpt-4”.
- async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- class llama_index.evaluation.BaseEvaluator#
Base Evaluator class.
- abstract async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- pydantic model llama_index.evaluation.BaseRetrievalEvaluator#
Base Retrieval Evaluator class.
Show JSON schema
{ "title": "BaseRetrievalEvaluator", "description": "Base Retrieval Evaluator class.", "type": "object", "properties": { "metrics": { "title": "Metrics", "description": "List of metrics to evaluate", "type": "array", "items": { "$ref": "#/definitions/BaseRetrievalMetric" } } }, "required": [ "metrics" ], "definitions": { "BaseRetrievalMetric": { "title": "BaseRetrievalMetric", "description": "Base class for retrieval metrics.", "type": "object", "properties": { "metric_name": { "title": "Metric Name", "type": "string" } }, "required": [ "metric_name" ] } } }
- Config
arbitrary_types_allowed: bool = True
- Fields
metrics (List[llama_index.evaluation.retrieval.metrics_base.BaseRetrievalMetric])
- field metrics: List[BaseRetrievalMetric] [Required]#
List of metrics to evaluate
- async aevaluate(query: str, expected_ids: List[str], expected_texts: Optional[List[str]] = None, mode: RetrievalEvalMode = RetrievalEvalMode.TEXT, **kwargs: Any) RetrievalEvalResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- async aevaluate_dataset(dataset: EmbeddingQAFinetuneDataset, workers: int = 2, show_progress: bool = False, **kwargs: Any) List[RetrievalEvalResult] #
Run evaluation with dataset.
- classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model #
Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = ‘allow’ was set since it adds all passed values
- copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model #
Duplicate a model, optionally choose which fields to include, exclude and change.
- Parameters
include – fields to include in new model
exclude – fields to exclude from new model, as with values this takes precedence over include
update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data
deep – set to True to make a deep copy of the model
- Returns
new model instance
- dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny #
Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.
- evaluate(query: str, expected_ids: List[str], expected_texts: Optional[List[str]] = None, mode: RetrievalEvalMode = RetrievalEvalMode.TEXT, **kwargs: Any) RetrievalEvalResult #
Run evaluation results with query string and expected ids.
- Parameters
query (str) – Query string
expected_ids (List[str]) – Expected ids
- Returns
Evaluation result
- Return type
- classmethod from_metric_names(metric_names: List[str], **kwargs: Any) BaseRetrievalEvaluator #
Create evaluator from metric names.
- Parameters
metric_names (List[str]) – List of metric names
**kwargs – Additional arguments for the evaluator
- classmethod from_orm(obj: Any) Model #
- json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode #
Generate a JSON representation of the model, include and exclude arguments as per dict().
encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().
- classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod parse_obj(obj: Any) Model #
- classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny #
- classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode #
- classmethod update_forward_refs(**localns: Any) None #
Try to update ForwardRefs on fields based on this Model, globalns and localns.
- classmethod validate(value: Any) Model #
- class llama_index.evaluation.BatchEvalRunner(evaluators: Dict[str, BaseEvaluator], workers: int = 2, show_progress: bool = False)#
Batch evaluation runner.
- Parameters
evaluators (Dict[str, BaseEvaluator]) – Dictionary of evaluators.
workers (int) – Number of workers to use for parallelization. Defaults to 2.
show_progress (bool) – Whether to show progress bars. Defaults to False.
- async aevaluate_queries(query_engine: BaseQueryEngine, queries: Optional[List[str]] = None, **eval_kwargs_lists: Dict[str, Any]) Dict[str, List[EvaluationResult]] #
Evaluate queries.
- Parameters
query_engine (BaseQueryEngine) – Query engine.
queries (Optional[List[str]]) – List of query strings. Defaults to None.
**eval_kwargs_lists (Dict[str, Any]) – Dict of lists of kwargs to pass to evaluator. Defaults to None.
- async aevaluate_response_strs(queries: Optional[List[str]] = None, response_strs: Optional[List[str]] = None, contexts_list: Optional[List[List[str]]] = None, **eval_kwargs_lists: List) Dict[str, List[EvaluationResult]] #
Evaluate query, response pairs.
This evaluates queries, responses, contexts as string inputs. Can supply additional kwargs to the evaluator in eval_kwargs_lists.
- Parameters
queries (Optional[List[str]]) – List of query strings. Defaults to None.
response_strs (Optional[List[str]]) – List of response strings. Defaults to None.
contexts_list (Optional[List[List[str]]]) – List of context lists. Defaults to None.
**eval_kwargs_lists (Dict[str, Any]) – Dict of lists of kwargs to pass to evaluator. Defaults to None.
- async aevaluate_responses(queries: Optional[List[str]] = None, responses: Optional[List[Response]] = None, **eval_kwargs_lists: Dict[str, Any]) Dict[str, List[EvaluationResult]] #
Evaluate query, response pairs.
This evaluates queries and response objects.
- Parameters
queries (Optional[List[str]]) – List of query strings. Defaults to None.
responses (Optional[List[Response]]) – List of response objects. Defaults to None.
**eval_kwargs_lists (Dict[str, Any]) – Dict of lists of kwargs to pass to evaluator. Defaults to None.
- evaluate_queries(query_engine: BaseQueryEngine, queries: Optional[List[str]] = None, **eval_kwargs_lists: Dict[str, Any]) Dict[str, List[EvaluationResult]] #
Evaluate queries.
Sync version of aevaluate_queries.
- evaluate_response_strs(queries: Optional[List[str]] = None, response_strs: Optional[List[str]] = None, contexts_list: Optional[List[List[str]]] = None, **eval_kwargs_lists: List) Dict[str, List[EvaluationResult]] #
Evaluate query, response pairs.
Sync version of aevaluate_response_strs.
- evaluate_responses(queries: Optional[List[str]] = None, responses: Optional[List[Response]] = None, **eval_kwargs_lists: Dict[str, Any]) Dict[str, List[EvaluationResult]] #
Evaluate query, response objs.
Sync version of aevaluate_responses.
- class llama_index.evaluation.ContextRelevancyEvaluator(service_context: llama_index.service_context.ServiceContext | None = None, raise_error: bool = False, eval_template: str | llama_index.prompts.base.BasePromptTemplate | None = None, refine_template: str | llama_index.prompts.base.BasePromptTemplate | None = None, score_threshold: float = 4.0, parser_function: ~typing.Callable[[str], ~typing.Tuple[~typing.Optional[float], ~typing.Optional[str]]] = <function _default_parser_function>)#
Context relevancy evaluator.
Evaluates the relevancy of retrieved contexts to a query. This evaluator considers the query string and retrieved contexts.
- Parameters
service_context (Optional[ServiceContext]) – The service context to use for evaluation.
raise_error (Optional[bool]) – Whether to raise an error if the response is invalid. Defaults to False.
eval_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for evaluation.
refine_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for refinement.
- async aevaluate(query: str | None = None, response: str | None = None, contexts: Optional[Sequence[str]] = None, sleep_time_in_seconds: int = 0, **kwargs: Any) EvaluationResult #
Evaluate whether the contexts is relevant to the query.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- class llama_index.evaluation.CorrectnessEvaluator(service_context: ~typing.Optional[~llama_index.service_context.ServiceContext] = None, eval_template: ~typing.Optional[~typing.Union[~llama_index.prompts.base.BasePromptTemplate, str]] = None, score_threshold: float = 4.0, parser_function: ~typing.Callable[[str], ~typing.Tuple[~typing.Optional[float], ~typing.Optional[str]]] = <function default_parser>)#
Correctness evaluator.
Evaluates the correctness of a question answering system. This evaluator depends on reference answer to be provided, in addition to the query string and response string.
It outputs a score between 1 and 5, where 1 is the worst and 5 is the best, along with a reasoning for the score. Passing is defined as a score greater than or equal to the given threshold.
- Parameters
service_context (Optional[ServiceContext]) – Service context.
eval_template (Optional[Union[BasePromptTemplate, str]]) – Template for the evaluation prompt.
score_threshold (float) – Numerical threshold for passing the evaluation, defaults to 4.0.
- async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, reference: Optional[str] = None, sleep_time_in_seconds: int = 0, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- class llama_index.evaluation.DatasetGenerator(*args, **kwargs)#
Generate dataset (question/ question-answer pairs) based on the given documents.
NOTE: this is a beta feature, subject to change!
- Parameters
nodes (List[Node]) – List of nodes. (Optional)
service_context (ServiceContext) – Service Context.
num_questions_per_chunk – number of question to be generated per chunk. Each document is chunked of size 512 words.
text_question_template – Question generation template.
question_gen_query – Question generation query.
- async agenerate_dataset_from_nodes(num: int | None = None) QueryResponseDataset #
Generates questions for each document.
- async agenerate_questions_from_nodes(num: int | None = None) List[str] #
Generates questions for each document.
- classmethod from_documents(documents: List[Document], service_context: llama_index.service_context.ServiceContext | None = None, num_questions_per_chunk: int = 10, text_question_template: llama_index.prompts.base.BasePromptTemplate | None = None, text_qa_template: llama_index.prompts.base.BasePromptTemplate | None = None, question_gen_query: str | None = None, required_keywords: Optional[List[str]] = None, exclude_keywords: Optional[List[str]] = None, show_progress: bool = False) DatasetGenerator #
Generate dataset from documents.
- generate_dataset_from_nodes(num: int | None = None) QueryResponseDataset #
Generates questions for each document.
- generate_questions_from_nodes(num: int | None = None) List[str] #
Generates questions for each document.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- pydantic model llama_index.evaluation.EmbeddingQAFinetuneDataset#
Embedding QA Finetuning Dataset.
- Parameters
queries (Dict[str, str]) – Dict id -> query.
corpus (Dict[str, str]) – Dict id -> string.
relevant_docs (Dict[str, List[str]]) – Dict query id -> list of doc ids.
Show JSON schema
{ "title": "EmbeddingQAFinetuneDataset", "description": "Embedding QA Finetuning Dataset.\n\nArgs:\n queries (Dict[str, str]): Dict id -> query.\n corpus (Dict[str, str]): Dict id -> string.\n relevant_docs (Dict[str, List[str]]): Dict query id -> list of doc ids.", "type": "object", "properties": { "queries": { "title": "Queries", "type": "object", "additionalProperties": { "type": "string" } }, "corpus": { "title": "Corpus", "type": "object", "additionalProperties": { "type": "string" } }, "relevant_docs": { "title": "Relevant Docs", "type": "object", "additionalProperties": { "type": "array", "items": { "type": "string" } } }, "mode": { "title": "Mode", "default": "text", "type": "string" } }, "required": [ "queries", "corpus", "relevant_docs" ] }
- Fields
corpus (Dict[str, str])
mode (str)
queries (Dict[str, str])
relevant_docs (Dict[str, List[str]])
- field corpus: Dict[str, str] [Required]#
- field mode: str = 'text'#
- field queries: Dict[str, str] [Required]#
- field relevant_docs: Dict[str, List[str]] [Required]#
- classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model #
Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = ‘allow’ was set since it adds all passed values
- copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model #
Duplicate a model, optionally choose which fields to include, exclude and change.
- Parameters
include – fields to include in new model
exclude – fields to exclude from new model, as with values this takes precedence over include
update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data
deep – set to True to make a deep copy of the model
- Returns
new model instance
- dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny #
Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.
- classmethod from_json(path: str) EmbeddingQAFinetuneDataset #
Load json.
- classmethod from_orm(obj: Any) Model #
- json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode #
Generate a JSON representation of the model, include and exclude arguments as per dict().
encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().
- classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod parse_obj(obj: Any) Model #
- classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- save_json(path: str) None #
Save json.
- classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny #
- classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode #
- classmethod update_forward_refs(**localns: Any) None #
Try to update ForwardRefs on fields based on this Model, globalns and localns.
- classmethod validate(value: Any) Model #
- property query_docid_pairs: List[Tuple[str, List[str]]]#
Get query, relevant doc ids.
- pydantic model llama_index.evaluation.EvaluationResult#
Evaluation result.
Output of an BaseEvaluator.
Show JSON schema
{ "title": "EvaluationResult", "description": "Evaluation result.\n\nOutput of an BaseEvaluator.", "type": "object", "properties": { "query": { "title": "Query", "description": "Query string", "type": "string" }, "contexts": { "title": "Contexts", "description": "Context strings", "type": "array", "items": { "type": "string" } }, "response": { "title": "Response", "description": "Response string", "type": "string" }, "passing": { "title": "Passing", "description": "Binary evaluation result (passing or not)", "type": "boolean" }, "feedback": { "title": "Feedback", "description": "Feedback or reasoning for the response", "type": "string" }, "score": { "title": "Score", "description": "Score for the response", "type": "number" }, "pairwise_source": { "title": "Pairwise Source", "description": "Used only for pairwise and specifies whether it is from original order of presented answers or flipped order", "type": "string" }, "invalid_result": { "title": "Invalid Result", "description": "Whether the evaluation result is an invalid one.", "default": false, "type": "boolean" }, "invalid_reason": { "title": "Invalid Reason", "description": "Reason for invalid evaluation.", "type": "string" } } }
- Fields
contexts (Optional[Sequence[str]])
feedback (Optional[str])
invalid_reason (Optional[str])
invalid_result (bool)
pairwise_source (Optional[str])
passing (Optional[bool])
query (Optional[str])
response (Optional[str])
score (Optional[float])
- field contexts: Optional[Sequence[str]] = None#
Context strings
- field feedback: Optional[str] = None#
Feedback or reasoning for the response
- field invalid_reason: Optional[str] = None#
Reason for invalid evaluation.
- field invalid_result: bool = False#
Whether the evaluation result is an invalid one.
- field pairwise_source: Optional[str] = None#
Used only for pairwise and specifies whether it is from original order of presented answers or flipped order
- field passing: Optional[bool] = None#
Binary evaluation result (passing or not)
- field query: Optional[str] = None#
Query string
- field response: Optional[str] = None#
Response string
- field score: Optional[float] = None#
Score for the response
- classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model #
Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = ‘allow’ was set since it adds all passed values
- copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model #
Duplicate a model, optionally choose which fields to include, exclude and change.
- Parameters
include – fields to include in new model
exclude – fields to exclude from new model, as with values this takes precedence over include
update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data
deep – set to True to make a deep copy of the model
- Returns
new model instance
- dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny #
Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.
- classmethod from_orm(obj: Any) Model #
- json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode #
Generate a JSON representation of the model, include and exclude arguments as per dict().
encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().
- classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod parse_obj(obj: Any) Model #
- classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny #
- classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode #
- classmethod update_forward_refs(**localns: Any) None #
Try to update ForwardRefs on fields based on this Model, globalns and localns.
- classmethod validate(value: Any) Model #
- class llama_index.evaluation.FaithfulnessEvaluator(service_context: llama_index.service_context.ServiceContext | None = None, raise_error: bool = False, eval_template: str | llama_index.prompts.base.BasePromptTemplate | None = None, refine_template: str | llama_index.prompts.base.BasePromptTemplate | None = None)#
Faithfulness evaluator.
Evaluates whether a response is faithful to the contexts (i.e. whether the response is supported by the contexts or hallucinated.)
This evaluator only considers the response string and the list of context strings.
- Parameters
service_context (Optional[ServiceContext]) – The service context to use for evaluation.
raise_error (bool) – Whether to raise an error when the response is invalid. Defaults to False.
eval_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for evaluation.
refine_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for refining the evaluation.
- async aevaluate(query: str | None = None, response: str | None = None, contexts: Optional[Sequence[str]] = None, sleep_time_in_seconds: int = 0, **kwargs: Any) EvaluationResult #
Evaluate whether the response is faithful to the contexts.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- class llama_index.evaluation.GuidelineEvaluator(service_context: Optional[ServiceContext] = None, guidelines: Optional[str] = None, eval_template: Optional[Union[BasePromptTemplate, str]] = None)#
Guideline evaluator.
Evaluates whether a query and response pair passes the given guidelines.
This evaluator only considers the query string and the response string.
- Parameters
service_context (Optional[ServiceContext]) – The service context to use for evaluation.
guidelines (Optional[str]) – User-added guidelines to use for evaluation. Defaults to None, which uses the default guidelines.
eval_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for evaluation.
- async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, sleep_time_in_seconds: int = 0, **kwargs: Any) EvaluationResult #
Evaluate whether the query and response pair passes the guidelines.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- pydantic model llama_index.evaluation.HitRate#
Hit rate metric.
Show JSON schema
{ "title": "HitRate", "description": "Hit rate metric.", "type": "object", "properties": { "metric_name": { "title": "Metric Name", "default": "hit_rate", "type": "string" } } }
- Config
arbitrary_types_allowed: bool = True
- Fields
metric_name (str)
- field metric_name: str = 'hit_rate'#
- compute(query: Optional[str] = None, expected_ids: Optional[List[str]] = None, retrieved_ids: Optional[List[str]] = None, expected_texts: Optional[List[str]] = None, retrieved_texts: Optional[List[str]] = None, **kwargs: Any) RetrievalMetricResult #
Compute metric.
- classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model #
Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = ‘allow’ was set since it adds all passed values
- copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model #
Duplicate a model, optionally choose which fields to include, exclude and change.
- Parameters
include – fields to include in new model
exclude – fields to exclude from new model, as with values this takes precedence over include
update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data
deep – set to True to make a deep copy of the model
- Returns
new model instance
- dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny #
Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.
- classmethod from_orm(obj: Any) Model #
- json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode #
Generate a JSON representation of the model, include and exclude arguments as per dict().
encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().
- classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod parse_obj(obj: Any) Model #
- classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny #
- classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode #
- classmethod update_forward_refs(**localns: Any) None #
Try to update ForwardRefs on fields based on this Model, globalns and localns.
- classmethod validate(value: Any) Model #
- llama_index.evaluation.LabelledQADataset#
alias of
EmbeddingQAFinetuneDataset
- pydantic model llama_index.evaluation.MRR#
MRR metric.
Show JSON schema
{ "title": "MRR", "description": "MRR metric.", "type": "object", "properties": { "metric_name": { "title": "Metric Name", "default": "mrr", "type": "string" } } }
- Config
arbitrary_types_allowed: bool = True
- Fields
metric_name (str)
- field metric_name: str = 'mrr'#
- compute(query: Optional[str] = None, expected_ids: Optional[List[str]] = None, retrieved_ids: Optional[List[str]] = None, expected_texts: Optional[List[str]] = None, retrieved_texts: Optional[List[str]] = None, **kwargs: Any) RetrievalMetricResult #
Compute metric.
- classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model #
Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = ‘allow’ was set since it adds all passed values
- copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model #
Duplicate a model, optionally choose which fields to include, exclude and change.
- Parameters
include – fields to include in new model
exclude – fields to exclude from new model, as with values this takes precedence over include
update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data
deep – set to True to make a deep copy of the model
- Returns
new model instance
- dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny #
Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.
- classmethod from_orm(obj: Any) Model #
- json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode #
Generate a JSON representation of the model, include and exclude arguments as per dict().
encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().
- classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod parse_obj(obj: Any) Model #
- classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny #
- classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode #
- classmethod update_forward_refs(**localns: Any) None #
Try to update ForwardRefs on fields based on this Model, globalns and localns.
- classmethod validate(value: Any) Model #
- pydantic model llama_index.evaluation.MultiModalRetrieverEvaluator#
Retriever evaluator.
This module will evaluate a retriever using a set of metrics.
- Parameters
metrics (List[BaseRetrievalMetric]) – Sequence of metrics to evaluate
retriever – Retriever to evaluate.
node_postprocessors (Optional[List[BaseNodePostprocessor]]) – Post-processor to apply after retrieval.
Show JSON schema
{ "title": "MultiModalRetrieverEvaluator", "description": "Retriever evaluator.\n\nThis module will evaluate a retriever using a set of metrics.\n\nArgs:\n metrics (List[BaseRetrievalMetric]): Sequence of metrics to evaluate\n retriever: Retriever to evaluate.\n node_postprocessors (Optional[List[BaseNodePostprocessor]]): Post-processor to apply after retrieval.", "type": "object", "properties": { "metrics": { "title": "Metrics", "description": "List of metrics to evaluate", "type": "array", "items": { "$ref": "#/definitions/BaseRetrievalMetric" } }, "retriever": { "title": "Retriever" }, "node_postprocessors": { "title": "Node Postprocessors" } }, "required": [ "metrics" ], "definitions": { "BaseRetrievalMetric": { "title": "BaseRetrievalMetric", "description": "Base class for retrieval metrics.", "type": "object", "properties": { "metric_name": { "title": "Metric Name", "type": "string" } }, "required": [ "metric_name" ] } } }
- Config
arbitrary_types_allowed: bool = True
- Fields
metrics (List[llama_index.evaluation.retrieval.metrics_base.BaseRetrievalMetric])
node_postprocessors (Optional[List[llama_index.postprocessor.types.BaseNodePostprocessor]])
retriever (llama_index.core.base_retriever.BaseRetriever)
- field metrics: List[BaseRetrievalMetric] [Required]#
List of metrics to evaluate
- field node_postprocessors: Optional[List[BaseNodePostprocessor]] = None#
Optional post-processor
- field retriever: BaseRetriever [Required]#
Retriever to evaluate
- async aevaluate(query: str, expected_ids: List[str], expected_texts: Optional[List[str]] = None, mode: RetrievalEvalMode = RetrievalEvalMode.TEXT, **kwargs: Any) RetrievalEvalResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- async aevaluate_dataset(dataset: EmbeddingQAFinetuneDataset, workers: int = 2, show_progress: bool = False, **kwargs: Any) List[RetrievalEvalResult] #
Run evaluation with dataset.
- classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model #
Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = ‘allow’ was set since it adds all passed values
- copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model #
Duplicate a model, optionally choose which fields to include, exclude and change.
- Parameters
include – fields to include in new model
exclude – fields to exclude from new model, as with values this takes precedence over include
update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data
deep – set to True to make a deep copy of the model
- Returns
new model instance
- dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny #
Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.
- evaluate(query: str, expected_ids: List[str], expected_texts: Optional[List[str]] = None, mode: RetrievalEvalMode = RetrievalEvalMode.TEXT, **kwargs: Any) RetrievalEvalResult #
Run evaluation results with query string and expected ids.
- Parameters
query (str) – Query string
expected_ids (List[str]) – Expected ids
- Returns
Evaluation result
- Return type
- classmethod from_metric_names(metric_names: List[str], **kwargs: Any) BaseRetrievalEvaluator #
Create evaluator from metric names.
- Parameters
metric_names (List[str]) – List of metric names
**kwargs – Additional arguments for the evaluator
- classmethod from_orm(obj: Any) Model #
- json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode #
Generate a JSON representation of the model, include and exclude arguments as per dict().
encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().
- classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod parse_obj(obj: Any) Model #
- classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny #
- classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode #
- classmethod update_forward_refs(**localns: Any) None #
Try to update ForwardRefs on fields based on this Model, globalns and localns.
- classmethod validate(value: Any) Model #
- class llama_index.evaluation.PairwiseComparisonEvaluator(service_context: ~typing.Optional[~llama_index.service_context.ServiceContext] = None, eval_template: ~typing.Optional[~typing.Union[~llama_index.prompts.base.BasePromptTemplate, str]] = None, parser_function: ~typing.Callable[[str], ~typing.Tuple[~typing.Optional[bool], ~typing.Optional[float], ~typing.Optional[str]]] = <function _default_parser_function>, enforce_consensus: bool = True)#
Pairwise comparison evaluator.
Evaluates the quality of a response vs. a “reference” response given a question by having an LLM judge which response is better.
Outputs whether the response given is better than the reference response.
- Parameters
service_context (Optional[ServiceContext]) – The service context to use for evaluation.
eval_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for evaluation.
enforce_consensus (bool) – Whether to enforce consensus (consistency if we flip the order of the answers). Defaults to True.
- async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, second_response: Optional[str] = None, reference: Optional[str] = None, sleep_time_in_seconds: int = 0, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- pydantic model llama_index.evaluation.QueryResponseDataset#
Query Response Dataset.
The response can be empty if the dataset is generated from documents.
- Parameters
queries (Dict[str, str]) – Query id -> query.
responses (Dict[str, str]) – Query id -> response.
Show JSON schema
{ "title": "QueryResponseDataset", "description": "Query Response Dataset.\n\nThe response can be empty if the dataset is generated from documents.\n\nArgs:\n queries (Dict[str, str]): Query id -> query.\n responses (Dict[str, str]): Query id -> response.", "type": "object", "properties": { "queries": { "title": "Queries", "description": "Query id -> query", "type": "object", "additionalProperties": { "type": "string" } }, "responses": { "title": "Responses", "description": "Query id -> response", "type": "object", "additionalProperties": { "type": "string" } } } }
- Fields
queries (Dict[str, str])
responses (Dict[str, str])
- field queries: Dict[str, str] [Optional]#
Query id -> query
- field responses: Dict[str, str] [Optional]#
Query id -> response
- classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model #
Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = ‘allow’ was set since it adds all passed values
- copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model #
Duplicate a model, optionally choose which fields to include, exclude and change.
- Parameters
include – fields to include in new model
exclude – fields to exclude from new model, as with values this takes precedence over include
update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data
deep – set to True to make a deep copy of the model
- Returns
new model instance
- dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny #
Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.
- classmethod from_json(path: str) QueryResponseDataset #
Load json.
- classmethod from_orm(obj: Any) Model #
- classmethod from_qr_pairs(qr_pairs: List[Tuple[str, str]]) QueryResponseDataset #
Create from qr pairs.
- json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode #
Generate a JSON representation of the model, include and exclude arguments as per dict().
encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().
- classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod parse_obj(obj: Any) Model #
- classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- save_json(path: str) None #
Save json.
- classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny #
- classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode #
- classmethod update_forward_refs(**localns: Any) None #
Try to update ForwardRefs on fields based on this Model, globalns and localns.
- classmethod validate(value: Any) Model #
- property qr_pairs: List[Tuple[str, str]]#
Get pairs.
- property questions: List[str]#
Get questions.
- llama_index.evaluation.QueryResponseEvaluator#
alias of
RelevancyEvaluator
- class llama_index.evaluation.RelevancyEvaluator(service_context: llama_index.service_context.ServiceContext | None = None, raise_error: bool = False, eval_template: str | llama_index.prompts.base.BasePromptTemplate | None = None, refine_template: str | llama_index.prompts.base.BasePromptTemplate | None = None)#
Relenvancy evaluator.
Evaluates the relevancy of retrieved contexts and response to a query. This evaluator considers the query string, retrieved contexts, and response string.
- Parameters
service_context (Optional[ServiceContext]) – The service context to use for evaluation.
raise_error (Optional[bool]) – Whether to raise an error if the response is invalid. Defaults to False.
eval_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for evaluation.
refine_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for refinement.
- async aevaluate(query: str | None = None, response: str | None = None, contexts: Optional[Sequence[str]] = None, sleep_time_in_seconds: int = 0, **kwargs: Any) EvaluationResult #
Evaluate whether the contexts and response are relevant to the query.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- llama_index.evaluation.ResponseEvaluator#
alias of
FaithfulnessEvaluator
- pydantic model llama_index.evaluation.RetrievalEvalResult#
Retrieval eval result.
NOTE: this abstraction might change in the future.
- query#
Query string
- Type
str
- expected_ids#
Expected ids
- Type
List[str]
- retrieved_ids#
Retrieved ids
- Type
List[str]
- metric_dict#
Metric dictionary for the evaluation
- Type
Dict[str, BaseRetrievalMetric]
Show JSON schema
{ "title": "RetrievalEvalResult", "description": "Retrieval eval result.\n\nNOTE: this abstraction might change in the future.\n\nAttributes:\n query (str): Query string\n expected_ids (List[str]): Expected ids\n retrieved_ids (List[str]): Retrieved ids\n metric_dict (Dict[str, BaseRetrievalMetric]): Metric dictionary for the evaluation", "type": "object", "properties": { "query": { "title": "Query", "description": "Query string", "type": "string" }, "expected_ids": { "title": "Expected Ids", "description": "Expected ids", "type": "array", "items": { "type": "string" } }, "expected_texts": { "title": "Expected Texts", "description": "Expected texts associated with nodes provided in `expected_ids`", "type": "array", "items": { "type": "string" } }, "retrieved_ids": { "title": "Retrieved Ids", "description": "Retrieved ids", "type": "array", "items": { "type": "string" } }, "retrieved_texts": { "title": "Retrieved Texts", "description": "Retrieved texts", "type": "array", "items": { "type": "string" } }, "mode": { "description": "text or image", "default": "text", "allOf": [ { "$ref": "#/definitions/RetrievalEvalMode" } ] }, "metric_dict": { "title": "Metric Dict", "description": "Metric dictionary for the evaluation", "type": "object", "additionalProperties": { "$ref": "#/definitions/RetrievalMetricResult" } } }, "required": [ "query", "expected_ids", "retrieved_ids", "retrieved_texts", "metric_dict" ], "definitions": { "RetrievalEvalMode": { "title": "RetrievalEvalMode", "description": "Evaluation of retrieval modality.", "enum": [ "text", "image" ], "type": "string" }, "RetrievalMetricResult": { "title": "RetrievalMetricResult", "description": "Metric result.\n\nAttributes:\n score (float): Score for the metric\n metadata (Dict[str, Any]): Metadata for the metric result", "type": "object", "properties": { "score": { "title": "Score", "description": "Score for the metric", "type": "number" }, "metadata": { "title": "Metadata", "description": "Metadata for the metric result", "type": "object" } }, "required": [ "score" ] } } }
- Config
arbitrary_types_allowed: bool = True
- Fields
expected_ids (List[str])
expected_texts (Optional[List[str]])
metric_dict (Dict[str, llama_index.evaluation.retrieval.metrics_base.RetrievalMetricResult])
mode (llama_index.evaluation.retrieval.base.RetrievalEvalMode)
query (str)
retrieved_ids (List[str])
retrieved_texts (List[str])
- field expected_ids: List[str] [Required]#
Expected ids
- field expected_texts: Optional[List[str]] = None#
Expected texts associated with nodes provided in expected_ids
- field metric_dict: Dict[str, RetrievalMetricResult] [Required]#
Metric dictionary for the evaluation
- field mode: RetrievalEvalMode = RetrievalEvalMode.TEXT#
text or image
- field query: str [Required]#
Query string
- field retrieved_ids: List[str] [Required]#
Retrieved ids
- field retrieved_texts: List[str] [Required]#
Retrieved texts
- classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model #
Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = ‘allow’ was set since it adds all passed values
- copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model #
Duplicate a model, optionally choose which fields to include, exclude and change.
- Parameters
include – fields to include in new model
exclude – fields to exclude from new model, as with values this takes precedence over include
update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data
deep – set to True to make a deep copy of the model
- Returns
new model instance
- dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny #
Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.
- classmethod from_orm(obj: Any) Model #
- json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode #
Generate a JSON representation of the model, include and exclude arguments as per dict().
encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().
- classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod parse_obj(obj: Any) Model #
- classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny #
- classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode #
- classmethod update_forward_refs(**localns: Any) None #
Try to update ForwardRefs on fields based on this Model, globalns and localns.
- classmethod validate(value: Any) Model #
- property metric_vals_dict: Dict[str, float]#
Dictionary of metric values.
- pydantic model llama_index.evaluation.RetrievalMetricResult#
Metric result.
- score#
Score for the metric
- Type
float
- metadata#
Metadata for the metric result
- Type
Dict[str, Any]
Show JSON schema
{ "title": "RetrievalMetricResult", "description": "Metric result.\n\nAttributes:\n score (float): Score for the metric\n metadata (Dict[str, Any]): Metadata for the metric result", "type": "object", "properties": { "score": { "title": "Score", "description": "Score for the metric", "type": "number" }, "metadata": { "title": "Metadata", "description": "Metadata for the metric result", "type": "object" } }, "required": [ "score" ] }
- Fields
metadata (Dict[str, Any])
score (float)
- field metadata: Dict[str, Any] [Optional]#
Metadata for the metric result
- field score: float [Required]#
Score for the metric
- classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model #
Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = ‘allow’ was set since it adds all passed values
- copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model #
Duplicate a model, optionally choose which fields to include, exclude and change.
- Parameters
include – fields to include in new model
exclude – fields to exclude from new model, as with values this takes precedence over include
update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data
deep – set to True to make a deep copy of the model
- Returns
new model instance
- dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny #
Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.
- classmethod from_orm(obj: Any) Model #
- json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode #
Generate a JSON representation of the model, include and exclude arguments as per dict().
encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().
- classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod parse_obj(obj: Any) Model #
- classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny #
- classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode #
- classmethod update_forward_refs(**localns: Any) None #
Try to update ForwardRefs on fields based on this Model, globalns and localns.
- classmethod validate(value: Any) Model #
- class llama_index.evaluation.RetrievalPrecisionEvaluator(openai_service: Optional[Any] = None)#
Tonic Validate’s retrieval precision metric.
The output score is a float between 0.0 and 1.0.
See https://docs.tonic.ai/validate/ for more details.
- Parameters
openai_service (OpenAIService) – The OpenAI service to use. Specifies the chat completion model to use as the LLM evaluator. Defaults to “gpt-4”.
- async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- pydantic model llama_index.evaluation.RetrieverEvaluator#
Retriever evaluator.
This module will evaluate a retriever using a set of metrics.
- Parameters
metrics (List[BaseRetrievalMetric]) – Sequence of metrics to evaluate
retriever – Retriever to evaluate.
node_postprocessors (Optional[List[BaseNodePostprocessor]]) – Post-processor to apply after retrieval.
Show JSON schema
{ "title": "RetrieverEvaluator", "description": "Retriever evaluator.\n\nThis module will evaluate a retriever using a set of metrics.\n\nArgs:\n metrics (List[BaseRetrievalMetric]): Sequence of metrics to evaluate\n retriever: Retriever to evaluate.\n node_postprocessors (Optional[List[BaseNodePostprocessor]]): Post-processor to apply after retrieval.", "type": "object", "properties": { "metrics": { "title": "Metrics", "description": "List of metrics to evaluate", "type": "array", "items": { "$ref": "#/definitions/BaseRetrievalMetric" } }, "retriever": { "title": "Retriever" }, "node_postprocessors": { "title": "Node Postprocessors" } }, "required": [ "metrics" ], "definitions": { "BaseRetrievalMetric": { "title": "BaseRetrievalMetric", "description": "Base class for retrieval metrics.", "type": "object", "properties": { "metric_name": { "title": "Metric Name", "type": "string" } }, "required": [ "metric_name" ] } } }
- Config
arbitrary_types_allowed: bool = True
- Fields
metrics (List[llama_index.evaluation.retrieval.metrics_base.BaseRetrievalMetric])
node_postprocessors (Optional[List[llama_index.postprocessor.types.BaseNodePostprocessor]])
retriever (llama_index.core.base_retriever.BaseRetriever)
- field metrics: List[BaseRetrievalMetric] [Required]#
List of metrics to evaluate
- field node_postprocessors: Optional[List[BaseNodePostprocessor]] = None#
Optional post-processor
- field retriever: BaseRetriever [Required]#
Retriever to evaluate
- async aevaluate(query: str, expected_ids: List[str], expected_texts: Optional[List[str]] = None, mode: RetrievalEvalMode = RetrievalEvalMode.TEXT, **kwargs: Any) RetrievalEvalResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- async aevaluate_dataset(dataset: EmbeddingQAFinetuneDataset, workers: int = 2, show_progress: bool = False, **kwargs: Any) List[RetrievalEvalResult] #
Run evaluation with dataset.
- classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model #
Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = ‘allow’ was set since it adds all passed values
- copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model #
Duplicate a model, optionally choose which fields to include, exclude and change.
- Parameters
include – fields to include in new model
exclude – fields to exclude from new model, as with values this takes precedence over include
update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data
deep – set to True to make a deep copy of the model
- Returns
new model instance
- dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny #
Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.
- evaluate(query: str, expected_ids: List[str], expected_texts: Optional[List[str]] = None, mode: RetrievalEvalMode = RetrievalEvalMode.TEXT, **kwargs: Any) RetrievalEvalResult #
Run evaluation results with query string and expected ids.
- Parameters
query (str) – Query string
expected_ids (List[str]) – Expected ids
- Returns
Evaluation result
- Return type
- classmethod from_metric_names(metric_names: List[str], **kwargs: Any) BaseRetrievalEvaluator #
Create evaluator from metric names.
- Parameters
metric_names (List[str]) – List of metric names
**kwargs – Additional arguments for the evaluator
- classmethod from_orm(obj: Any) Model #
- json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode #
Generate a JSON representation of the model, include and exclude arguments as per dict().
encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().
- classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod parse_obj(obj: Any) Model #
- classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model #
- classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny #
- classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode #
- classmethod update_forward_refs(**localns: Any) None #
Try to update ForwardRefs on fields based on this Model, globalns and localns.
- classmethod validate(value: Any) Model #
- class llama_index.evaluation.SemanticSimilarityEvaluator(service_context: Optional[ServiceContext] = None, similarity_fn: Optional[Callable[[...], float]] = None, similarity_mode: Optional[SimilarityMode] = None, similarity_threshold: float = 0.8)#
Embedding similarity evaluator.
Evaluate the quality of a question answering system by comparing the similarity between embeddings of the generated answer and the reference answer.
Inspired by this paper: - Semantic Answer Similarity for Evaluating Question Answering Models
- Parameters
service_context (Optional[ServiceContext]) – Service context.
similarity_threshold (float) – Embedding similarity threshold for “passing”. Defaults to 0.8.
- async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, reference: Optional[str] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- class llama_index.evaluation.TonicValidateEvaluator(metrics: Optional[List[Any]] = None, model_evaluator: str = 'gpt-4')#
Tonic Validate’s validate scorer. Calculates all of Tonic Validate’s metrics.
See https://docs.tonic.ai/validate/ for more details.
- Parameters
metrics (List[Metric]) – The metrics to use. Defaults to all of Tonic Validate’s metrics.
model_evaluator (str) – The OpenAI service to use. Specifies the chat completion model to use as the LLM evaluator. Defaults to “gpt-4”.
- async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, reference_response: Optional[str] = None, **kwargs: Any) TonicValidateEvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- async aevaluate_run(queries: List[str], responses: List[str], contexts_list: List[List[str]], reference_responses: List[str], **kwargs: Any) Any #
Evaluates a batch of responses.
Returns a Tonic Validate Run object, which can be logged to the Tonic Validate UI. See https://docs.tonic.ai/validate/ for more details.
- evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string, retrieved contexts, and generated response string.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult #
Run evaluation with query string and generated Response object.
Subclasses can override this method to provide custom evaluation logic and take in additional arguments.
- evaluate_run(queries: List[str], responses: List[str], contexts_list: List[List[str]], reference_responses: List[str], **kwargs: Any) Any #
Evaluates a batch of responses.
Returns a Tonic Validate Run object, which can be logged to the Tonic Validate UI. See https://docs.tonic.ai/validate/ for more details.
- get_prompts() Dict[str, BasePromptTemplate] #
Get a prompt.
- update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None #
Update prompts.
Other prompts will remain in place.
- llama_index.evaluation.generate_qa_embedding_pairs(nodes: List[TextNode], llm: LLM, qa_generate_prompt_tmpl: str = 'Context information is below.\n\n---------------------\n{context_str}\n---------------------\n\nGiven the context information and not prior knowledge.\ngenerate only questions based on the below query.\n\nYou are a Teacher/ Professor. Your task is to setup {num_questions_per_chunk} questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. Restrict the questions to the context information provided."\n', num_questions_per_chunk: int = 2) EmbeddingQAFinetuneDataset #
Generate examples given a set of nodes.
- llama_index.evaluation.generate_question_context_pairs(nodes: List[TextNode], llm: LLM, qa_generate_prompt_tmpl: str = 'Context information is below.\n\n---------------------\n{context_str}\n---------------------\n\nGiven the context information and not prior knowledge.\ngenerate only questions based on the below query.\n\nYou are a Teacher/ Professor. Your task is to setup {num_questions_per_chunk} questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. Restrict the questions to the context information provided."\n', num_questions_per_chunk: int = 2) EmbeddingQAFinetuneDataset #
Generate examples given a set of nodes.
- llama_index.evaluation.get_retrieval_results_df(names: List[str], results_arr: List[List[RetrievalEvalResult]], metric_keys: Optional[List[str]] = None) DataFrame #
Display retrieval results.
- llama_index.evaluation.resolve_metrics(metrics: List[str]) List[Type[BaseRetrievalMetric]] #
Resolve metrics from list of metric names.