Evaluation

We have modules for both LLM-based evaluation and retrieval-based evaluation.

Evaluation modules.

class llama_index.evaluation.BaseEvaluator

Base Evaluator class.

abstract async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string, retrieved contexts, and generated response string.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string and generated Response object.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string, retrieved contexts, and generated response string.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string and generated Response object.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

get_prompts() Dict[str, BasePromptTemplate]

Get a prompt.

update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None

Update prompts.

Other prompts will remain in place.

pydantic model llama_index.evaluation.BaseRetrievalEvaluator

Base Retrieval Evaluator class.

Show JSON schema
{
   "title": "BaseRetrievalEvaluator",
   "description": "Base Retrieval Evaluator class.",
   "type": "object",
   "properties": {
      "metrics": {
         "title": "Metrics"
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • metrics (List[llama_index.evaluation.retrieval.metrics_base.BaseRetrievalMetric])

field metrics: List[BaseRetrievalMetric] [Required]

List of metrics to evaluate

async aevaluate(query: str, expected_ids: List[str], mode: RetrievalEvalMode = RetrievalEvalMode.TEXT, **kwargs: Any) RetrievalEvalResult

Run evaluation with query string, retrieved contexts, and generated response string.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

async aevaluate_dataset(dataset: EmbeddingQAFinetuneDataset, workers: int = 2, show_progress: bool = False, **kwargs: Any) List[RetrievalEvalResult]

Run evaluation with dataset.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

evaluate(query: str, expected_ids: List[str], mode: RetrievalEvalMode = RetrievalEvalMode.TEXT, **kwargs: Any) RetrievalEvalResult

Run evaluation results with query string and expected ids.

Parameters
  • query (str) – Query string

  • expected_ids (List[str]) – Expected ids

Returns

Evaluation result

Return type

RetrievalEvalResult

classmethod from_metric_names(metric_names: List[str], **kwargs: Any) BaseRetrievalEvaluator

Create evaluator from metric names.

Parameters
  • metric_names (List[str]) – List of metric names

  • **kwargs – Additional arguments for the evaluator

classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
class llama_index.evaluation.BatchEvalRunner(evaluators: Dict[str, BaseEvaluator], workers: int = 2, show_progress: bool = False)

Batch evaluation runner.

Parameters
  • evaluators (Dict[str, BaseEvaluator]) – Dictionary of evaluators.

  • workers (int) – Number of workers to use for parallelization. Defaults to 2.

  • show_progress (bool) – Whether to show progress bars. Defaults to False.

async aevaluate_queries(query_engine: BaseQueryEngine, queries: Optional[List[str]] = None, **eval_kwargs_lists: Dict[str, Any]) Dict[str, List[EvaluationResult]]

Evaluate queries.

Parameters
  • query_engine (BaseQueryEngine) – Query engine.

  • queries (Optional[List[str]]) – List of query strings. Defaults to None.

  • **eval_kwargs_lists (Dict[str, Any]) – Dict of lists of kwargs to pass to evaluator. Defaults to None.

async aevaluate_response_strs(queries: Optional[List[str]] = None, response_strs: Optional[List[str]] = None, contexts_list: Optional[List[List[str]]] = None, **eval_kwargs_lists: List) Dict[str, List[EvaluationResult]]

Evaluate query, response pairs.

This evaluates queries, responses, contexts as string inputs. Can supply additional kwargs to the evaluator in eval_kwargs_lists.

Parameters
  • queries (Optional[List[str]]) – List of query strings. Defaults to None.

  • response_strs (Optional[List[str]]) – List of response strings. Defaults to None.

  • contexts_list (Optional[List[List[str]]]) – List of context lists. Defaults to None.

  • **eval_kwargs_lists (Dict[str, Any]) – Dict of lists of kwargs to pass to evaluator. Defaults to None.

async aevaluate_responses(queries: Optional[List[str]] = None, responses: Optional[List[Response]] = None, **eval_kwargs_lists: Dict[str, Any]) Dict[str, List[EvaluationResult]]

Evaluate query, response pairs.

This evaluates queries and response objects.

Parameters
  • queries (Optional[List[str]]) – List of query strings. Defaults to None.

  • responses (Optional[List[Response]]) – List of response objects. Defaults to None.

  • **eval_kwargs_lists (Dict[str, Any]) – Dict of lists of kwargs to pass to evaluator. Defaults to None.

evaluate_queries(query_engine: BaseQueryEngine, queries: Optional[List[str]] = None, **eval_kwargs_lists: Dict[str, Any]) Dict[str, List[EvaluationResult]]

Evaluate queries.

Sync version of aevaluate_queries.

evaluate_response_strs(queries: Optional[List[str]] = None, response_strs: Optional[List[str]] = None, contexts_list: Optional[List[List[str]]] = None, **eval_kwargs_lists: List) Dict[str, List[EvaluationResult]]

Evaluate query, response pairs.

Sync version of aevaluate_response_strs.

evaluate_responses(queries: Optional[List[str]] = None, responses: Optional[List[Response]] = None, **eval_kwargs_lists: Dict[str, Any]) Dict[str, List[EvaluationResult]]

Evaluate query, response objs.

Sync version of aevaluate_responses.

class llama_index.evaluation.CorrectnessEvaluator(service_context: Optional[ServiceContext] = None, eval_template: Optional[Union[BasePromptTemplate, str]] = None, score_threshold: float = 4.0)

Correctness evaluator.

Evaluates the correctness of a question answering system. This evaluator depends on reference answer to be provided, in addition to the query string and response string.

It outputs a score between 1 and 5, where 1 is the worst and 5 is the best, along with a reasoning for the score. Passing is defined as a score greater than or equal to the given threshold.

Parameters
  • service_context (Optional[ServiceContext]) – Service context.

  • eval_template (Optional[Union[BasePromptTemplate, str]]) – Template for the evaluation prompt.

  • score_threshold (float) – Numerical threshold for passing the evaluation, defaults to 4.0.

async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, reference: Optional[str] = None, sleep_time_in_seconds: int = 0, **kwargs: Any) EvaluationResult

Run evaluation with query string, retrieved contexts, and generated response string.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string and generated Response object.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string, retrieved contexts, and generated response string.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string and generated Response object.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

get_prompts() Dict[str, BasePromptTemplate]

Get a prompt.

update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None

Update prompts.

Other prompts will remain in place.

class llama_index.evaluation.DatasetGenerator(*args, **kwargs)

Generate dataset (question/ question-answer pairs) based on the given documents.

NOTE: this is a beta feature, subject to change!

Parameters
  • nodes (List[Node]) – List of nodes. (Optional)

  • service_context (ServiceContext) – Service Context.

  • num_questions_per_chunk – number of question to be generated per chunk. Each document is chunked of size 512 words.

  • text_question_template – Question generation template.

  • question_gen_query – Question generation query.

async agenerate_dataset_from_nodes(num: int | None = None) QueryResponseDataset

Generates questions for each document.

async agenerate_questions_from_nodes(num: int | None = None) List[str]

Generates questions for each document.

classmethod from_documents(documents: List[Document], service_context: llama_index.service_context.ServiceContext | None = None, num_questions_per_chunk: int = 10, text_question_template: llama_index.prompts.base.BasePromptTemplate | None = None, text_qa_template: llama_index.prompts.base.BasePromptTemplate | None = None, question_gen_query: str | None = None, required_keywords: Optional[List[str]] = None, exclude_keywords: Optional[List[str]] = None, show_progress: bool = False) DatasetGenerator

Generate dataset from documents.

generate_dataset_from_nodes(num: int | None = None) QueryResponseDataset

Generates questions for each document.

generate_questions_from_nodes(num: int | None = None) List[str]

Generates questions for each document.

get_prompts() Dict[str, BasePromptTemplate]

Get a prompt.

update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None

Update prompts.

Other prompts will remain in place.

pydantic model llama_index.evaluation.EmbeddingQAFinetuneDataset

Embedding QA Finetuning Dataset.

Parameters
  • queries (Dict[str, str]) – Dict id -> query.

  • corpus (Dict[str, str]) – Dict id -> string.

  • relevant_docs (Dict[str, List[str]]) – Dict query id -> list of doc ids.

Show JSON schema
{
   "title": "EmbeddingQAFinetuneDataset",
   "description": "Embedding QA Finetuning Dataset.\n\nArgs:\n    queries (Dict[str, str]): Dict id -> query.\n    corpus (Dict[str, str]): Dict id -> string.\n    relevant_docs (Dict[str, List[str]]): Dict query id -> list of doc ids.",
   "type": "object",
   "properties": {
      "queries": {
         "title": "Queries",
         "type": "object",
         "additionalProperties": {
            "type": "string"
         }
      },
      "corpus": {
         "title": "Corpus",
         "type": "object",
         "additionalProperties": {
            "type": "string"
         }
      },
      "relevant_docs": {
         "title": "Relevant Docs",
         "type": "object",
         "additionalProperties": {
            "type": "array",
            "items": {
               "type": "string"
            }
         }
      },
      "mode": {
         "title": "Mode",
         "default": "text",
         "type": "string"
      }
   },
   "required": [
      "queries",
      "corpus",
      "relevant_docs"
   ]
}

Fields
  • corpus (Dict[str, str])

  • mode (str)

  • queries (Dict[str, str])

  • relevant_docs (Dict[str, List[str]])

field corpus: Dict[str, str] [Required]
field mode: str = 'text'
field queries: Dict[str, str] [Required]
field relevant_docs: Dict[str, List[str]] [Required]
classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_json(path: str) EmbeddingQAFinetuneDataset

Load json.

classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
save_json(path: str) None

Save json.

classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
property query_docid_pairs: List[Tuple[str, List[str]]]

Get query, relevant doc ids.

pydantic model llama_index.evaluation.EvaluationResult

Evaluation result.

Output of an BaseEvaluator.

Show JSON schema
{
   "title": "EvaluationResult",
   "description": "Evaluation result.\n\nOutput of an BaseEvaluator.",
   "type": "object",
   "properties": {
      "query": {
         "title": "Query",
         "description": "Query string",
         "type": "string"
      },
      "contexts": {
         "title": "Contexts",
         "description": "Context strings",
         "type": "array",
         "items": {
            "type": "string"
         }
      },
      "response": {
         "title": "Response",
         "description": "Response string",
         "type": "string"
      },
      "passing": {
         "title": "Passing",
         "description": "Binary evaluation result (passing or not)",
         "type": "boolean"
      },
      "feedback": {
         "title": "Feedback",
         "description": "Feedback or reasoning for the response",
         "type": "string"
      },
      "score": {
         "title": "Score",
         "description": "Score for the response",
         "type": "number"
      },
      "pairwise_source": {
         "title": "Pairwise Source",
         "description": "Used only for pairwise and specifies whether it is from original order of presented answers or flipped order",
         "type": "string"
      }
   }
}

Fields
  • contexts (Optional[Sequence[str]])

  • feedback (Optional[str])

  • pairwise_source (Optional[str])

  • passing (Optional[bool])

  • query (Optional[str])

  • response (Optional[str])

  • score (Optional[float])

field contexts: Optional[Sequence[str]] = None

Context strings

field feedback: Optional[str] = None

Feedback or reasoning for the response

field pairwise_source: Optional[str] = None

Used only for pairwise and specifies whether it is from original order of presented answers or flipped order

field passing: Optional[bool] = None

Binary evaluation result (passing or not)

field query: Optional[str] = None

Query string

field response: Optional[str] = None

Response string

field score: Optional[float] = None

Score for the response

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
class llama_index.evaluation.FaithfulnessEvaluator(service_context: llama_index.service_context.ServiceContext | None = None, raise_error: bool = False, eval_template: str | llama_index.prompts.base.BasePromptTemplate | None = None, refine_template: str | llama_index.prompts.base.BasePromptTemplate | None = None)

Faithfulness evaluator.

Evaluates whether a response is faithful to the contexts (i.e. whether the response is supported by the contexts or hallucinated.)

This evaluator only considers the response string and the list of context strings.

Parameters
  • service_context (Optional[ServiceContext]) – The service context to use for evaluation.

  • raise_error (bool) – Whether to raise an error when the response is invalid. Defaults to False.

  • eval_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for evaluation.

  • refine_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for refining the evaluation.

async aevaluate(query: str | None = None, response: str | None = None, contexts: Optional[Sequence[str]] = None, sleep_time_in_seconds: int = 0, **kwargs: Any) EvaluationResult

Evaluate whether the response is faithful to the contexts.

async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string and generated Response object.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string, retrieved contexts, and generated response string.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string and generated Response object.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

get_prompts() Dict[str, BasePromptTemplate]

Get a prompt.

update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None

Update prompts.

Other prompts will remain in place.

class llama_index.evaluation.GuidelineEvaluator(service_context: Optional[ServiceContext] = None, guidelines: Optional[str] = None, eval_template: Optional[Union[BasePromptTemplate, str]] = None)

Guideline evaluator.

Evaluates whether a query and response pair passes the given guidelines.

This evaluator only considers the query string and the response string.

Parameters
  • service_context (Optional[ServiceContext]) – The service context to use for evaluation.

  • guidelines (Optional[str]) – User-added guidelines to use for evaluation. Defaults to None, which uses the default guidelines.

  • eval_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for evaluation.

async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, sleep_time_in_seconds: int = 0, **kwargs: Any) EvaluationResult

Evaluate whether the query and response pair passes the guidelines.

async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string and generated Response object.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string, retrieved contexts, and generated response string.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string and generated Response object.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

get_prompts() Dict[str, BasePromptTemplate]

Get a prompt.

update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None

Update prompts.

Other prompts will remain in place.

class llama_index.evaluation.HitRate

Hit rate metric.

compute(query: Optional[str] = None, expected_ids: Optional[List[str]] = None, retrieved_ids: Optional[List[str]] = None, **kwargs: Any) RetrievalMetricResult

Compute metric.

llama_index.evaluation.LabelledQADataset

alias of EmbeddingQAFinetuneDataset

class llama_index.evaluation.MRR

MRR metric.

compute(query: Optional[str] = None, expected_ids: Optional[List[str]] = None, retrieved_ids: Optional[List[str]] = None, **kwargs: Any) RetrievalMetricResult

Compute metric.

pydantic model llama_index.evaluation.MultiModalRetrieverEvaluator

Retriever evaluator.

This module will evaluate a retriever using a set of metrics.

Parameters
  • metrics (List[BaseRetrievalMetric]) – Sequence of metrics to evaluate

  • retriever – Retriever to evaluate.

Show JSON schema
{
   "title": "MultiModalRetrieverEvaluator",
   "description": "Retriever evaluator.\n\nThis module will evaluate a retriever using a set of metrics.\n\nArgs:\n    metrics (List[BaseRetrievalMetric]): Sequence of metrics to evaluate\n    retriever: Retriever to evaluate.",
   "type": "object",
   "properties": {
      "metrics": {
         "title": "Metrics"
      },
      "retriever": {
         "title": "Retriever"
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • metrics (List[llama_index.evaluation.retrieval.metrics_base.BaseRetrievalMetric])

  • retriever (llama_index.core.base_retriever.BaseRetriever)

field metrics: List[BaseRetrievalMetric] [Required]

List of metrics to evaluate

field retriever: BaseRetriever [Required]

Retriever to evaluate

async aevaluate(query: str, expected_ids: List[str], mode: RetrievalEvalMode = RetrievalEvalMode.TEXT, **kwargs: Any) RetrievalEvalResult

Run evaluation with query string, retrieved contexts, and generated response string.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

async aevaluate_dataset(dataset: EmbeddingQAFinetuneDataset, workers: int = 2, show_progress: bool = False, **kwargs: Any) List[RetrievalEvalResult]

Run evaluation with dataset.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

evaluate(query: str, expected_ids: List[str], mode: RetrievalEvalMode = RetrievalEvalMode.TEXT, **kwargs: Any) RetrievalEvalResult

Run evaluation results with query string and expected ids.

Parameters
  • query (str) – Query string

  • expected_ids (List[str]) – Expected ids

Returns

Evaluation result

Return type

RetrievalEvalResult

classmethod from_metric_names(metric_names: List[str], **kwargs: Any) BaseRetrievalEvaluator

Create evaluator from metric names.

Parameters
  • metric_names (List[str]) – List of metric names

  • **kwargs – Additional arguments for the evaluator

classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
class llama_index.evaluation.PairwiseComparisonEvaluator(service_context: Optional[ServiceContext] = None, eval_template: Optional[Union[BasePromptTemplate, str]] = None, enforce_consensus: bool = True)

Pairwise comparison evaluator.

Evaluates the quality of a response vs. a β€œreference” response given a question by having an LLM judge which response is better.

Outputs whether the response given is better than the reference response.

Parameters
  • service_context (Optional[ServiceContext]) – The service context to use for evaluation.

  • eval_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for evaluation.

  • enforce_consensus (bool) – Whether to enforce consensus (consistency if we flip the order of the answers). Defaults to True.

async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, second_response: Optional[str] = None, reference: Optional[str] = None, sleep_time_in_seconds: int = 0, **kwargs: Any) EvaluationResult

Run evaluation with query string, retrieved contexts, and generated response string.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string and generated Response object.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string, retrieved contexts, and generated response string.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string and generated Response object.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

get_prompts() Dict[str, BasePromptTemplate]

Get a prompt.

update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None

Update prompts.

Other prompts will remain in place.

pydantic model llama_index.evaluation.QueryResponseDataset

Query Response Dataset.

The response can be empty if the dataset is generated from documents.

Parameters
  • queries (Dict[str, str]) – Query id -> query.

  • responses (Dict[str, str]) – Query id -> response.

Show JSON schema
{
   "title": "QueryResponseDataset",
   "description": "Query Response Dataset.\n\nThe response can be empty if the dataset is generated from documents.\n\nArgs:\n    queries (Dict[str, str]): Query id -> query.\n    responses (Dict[str, str]): Query id -> response.",
   "type": "object",
   "properties": {
      "queries": {
         "title": "Queries",
         "description": "Query id -> query",
         "type": "object",
         "additionalProperties": {
            "type": "string"
         }
      },
      "responses": {
         "title": "Responses",
         "description": "Query id -> response",
         "type": "object",
         "additionalProperties": {
            "type": "string"
         }
      }
   }
}

Fields
  • queries (Dict[str, str])

  • responses (Dict[str, str])

field queries: Dict[str, str] [Optional]

Query id -> query

field responses: Dict[str, str] [Optional]

Query id -> response

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_json(path: str) QueryResponseDataset

Load json.

classmethod from_orm(obj: Any) Model
classmethod from_qr_pairs(qr_pairs: List[Tuple[str, str]]) QueryResponseDataset

Create from qr pairs.

json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
save_json(path: str) None

Save json.

classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
property qr_pairs: List[Tuple[str, str]]

Get pairs.

property questions: List[str]

Get questions.

llama_index.evaluation.QueryResponseEvaluator

alias of RelevancyEvaluator

class llama_index.evaluation.RelevancyEvaluator(service_context: llama_index.service_context.ServiceContext | None = None, raise_error: bool = False, eval_template: str | llama_index.prompts.base.BasePromptTemplate | None = None, refine_template: str | llama_index.prompts.base.BasePromptTemplate | None = None)

Relenvancy evaluator.

Evaluates the relevancy of retrieved contexts and response to a query. This evaluator considers the query string, retrieved contexts, and response string.

Parameters
  • service_context (Optional[ServiceContext]) – The service context to use for evaluation.

  • raise_error (Optional[bool]) – Whether to raise an error if the response is invalid. Defaults to False.

  • eval_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for evaluation.

  • refine_template (Optional[Union[str, BasePromptTemplate]]) – The template to use for refinement.

async aevaluate(query: str | None = None, response: str | None = None, contexts: Optional[Sequence[str]] = None, sleep_time_in_seconds: int = 0, **kwargs: Any) EvaluationResult

Evaluate whether the contexts and response are relevant to the query.

async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string and generated Response object.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string, retrieved contexts, and generated response string.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string and generated Response object.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

get_prompts() Dict[str, BasePromptTemplate]

Get a prompt.

update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None

Update prompts.

Other prompts will remain in place.

llama_index.evaluation.ResponseEvaluator

alias of FaithfulnessEvaluator

pydantic model llama_index.evaluation.RetrievalEvalResult

Retrieval eval result.

NOTE: this abstraction might change in the future.

query

Query string

Type

str

expected_ids

Expected ids

Type

List[str]

retrieved_ids

Retrieved ids

Type

List[str]

metric_dict

Metric dictionary for the evaluation

Type

Dict[str, BaseRetrievalMetric]

Show JSON schema
{
   "title": "RetrievalEvalResult",
   "description": "Retrieval eval result.\n\nNOTE: this abstraction might change in the future.\n\nAttributes:\n    query (str): Query string\n    expected_ids (List[str]): Expected ids\n    retrieved_ids (List[str]): Retrieved ids\n    metric_dict (Dict[str, BaseRetrievalMetric]):             Metric dictionary for the evaluation",
   "type": "object",
   "properties": {
      "query": {
         "title": "Query",
         "description": "Query string",
         "type": "string"
      },
      "expected_ids": {
         "title": "Expected Ids",
         "description": "Expected ids",
         "type": "array",
         "items": {
            "type": "string"
         }
      },
      "retrieved_ids": {
         "title": "Retrieved Ids",
         "description": "Retrieved ids",
         "type": "array",
         "items": {
            "type": "string"
         }
      },
      "mode": {
         "description": "text or image",
         "default": "text",
         "allOf": [
            {
               "$ref": "#/definitions/RetrievalEvalMode"
            }
         ]
      },
      "metric_dict": {
         "title": "Metric Dict",
         "description": "Metric dictionary for the evaluation",
         "type": "object",
         "additionalProperties": {
            "$ref": "#/definitions/RetrievalMetricResult"
         }
      }
   },
   "required": [
      "query",
      "expected_ids",
      "retrieved_ids",
      "metric_dict"
   ],
   "definitions": {
      "RetrievalEvalMode": {
         "title": "RetrievalEvalMode",
         "description": "Evaluation of retrieval modality.",
         "enum": [
            "text",
            "image"
         ],
         "type": "string"
      },
      "RetrievalMetricResult": {
         "title": "RetrievalMetricResult",
         "description": "Metric result.\n\nAttributes:\n    score (float): Score for the metric\n    metadata (Dict[str, Any]): Metadata for the metric result",
         "type": "object",
         "properties": {
            "score": {
               "title": "Score",
               "description": "Score for the metric",
               "type": "number"
            },
            "metadata": {
               "title": "Metadata",
               "description": "Metadata for the metric result",
               "type": "object"
            }
         },
         "required": [
            "score"
         ]
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • expected_ids (List[str])

  • metric_dict (Dict[str, llama_index.evaluation.retrieval.metrics_base.RetrievalMetricResult])

  • mode (llama_index.evaluation.retrieval.base.RetrievalEvalMode)

  • query (str)

  • retrieved_ids (List[str])

field expected_ids: List[str] [Required]

Expected ids

field metric_dict: Dict[str, RetrievalMetricResult] [Required]

Metric dictionary for the evaluation

field mode: RetrievalEvalMode = RetrievalEvalMode.TEXT

text or image

field query: str [Required]

Query string

field retrieved_ids: List[str] [Required]

Retrieved ids

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
property metric_vals_dict: Dict[str, float]

Dictionary of metric values.

pydantic model llama_index.evaluation.RetrievalMetricResult

Metric result.

score

Score for the metric

Type

float

metadata

Metadata for the metric result

Type

Dict[str, Any]

Show JSON schema
{
   "title": "RetrievalMetricResult",
   "description": "Metric result.\n\nAttributes:\n    score (float): Score for the metric\n    metadata (Dict[str, Any]): Metadata for the metric result",
   "type": "object",
   "properties": {
      "score": {
         "title": "Score",
         "description": "Score for the metric",
         "type": "number"
      },
      "metadata": {
         "title": "Metadata",
         "description": "Metadata for the metric result",
         "type": "object"
      }
   },
   "required": [
      "score"
   ]
}

Fields
  • metadata (Dict[str, Any])

  • score (float)

field metadata: Dict[str, Any] [Optional]

Metadata for the metric result

field score: float [Required]

Score for the metric

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
pydantic model llama_index.evaluation.RetrieverEvaluator

Retriever evaluator.

This module will evaluate a retriever using a set of metrics.

Parameters
  • metrics (List[BaseRetrievalMetric]) – Sequence of metrics to evaluate

  • retriever – Retriever to evaluate.

Show JSON schema
{
   "title": "RetrieverEvaluator",
   "description": "Retriever evaluator.\n\nThis module will evaluate a retriever using a set of metrics.\n\nArgs:\n    metrics (List[BaseRetrievalMetric]): Sequence of metrics to evaluate\n    retriever: Retriever to evaluate.",
   "type": "object",
   "properties": {
      "metrics": {
         "title": "Metrics"
      },
      "retriever": {
         "title": "Retriever"
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • metrics (List[llama_index.evaluation.retrieval.metrics_base.BaseRetrievalMetric])

  • retriever (llama_index.core.base_retriever.BaseRetriever)

field metrics: List[BaseRetrievalMetric] [Required]

List of metrics to evaluate

field retriever: BaseRetriever [Required]

Retriever to evaluate

async aevaluate(query: str, expected_ids: List[str], mode: RetrievalEvalMode = RetrievalEvalMode.TEXT, **kwargs: Any) RetrievalEvalResult

Run evaluation with query string, retrieved contexts, and generated response string.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

async aevaluate_dataset(dataset: EmbeddingQAFinetuneDataset, workers: int = 2, show_progress: bool = False, **kwargs: Any) List[RetrievalEvalResult]

Run evaluation with dataset.

classmethod construct(_fields_set: Optional[SetStr] = None, **values: Any) Model

Creates a new model setting __dict__ and __fields_set__ from trusted or pre-validated data. Default values are respected, but no other validation is performed. Behaves as if Config.extra = β€˜allow’ was set since it adds all passed values

copy(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, update: Optional[DictStrAny] = None, deep: bool = False) Model

Duplicate a model, optionally choose which fields to include, exclude and change.

Parameters
  • include – fields to include in new model

  • exclude – fields to exclude from new model, as with values this takes precedence over include

  • update – values to change/add in the new model. Note: the data is not validated before creating the new model: you should trust this data

  • deep – set to True to make a deep copy of the model

Returns

new model instance

dict(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny

Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.

evaluate(query: str, expected_ids: List[str], mode: RetrievalEvalMode = RetrievalEvalMode.TEXT, **kwargs: Any) RetrievalEvalResult

Run evaluation results with query string and expected ids.

Parameters
  • query (str) – Query string

  • expected_ids (List[str]) – Expected ids

Returns

Evaluation result

Return type

RetrievalEvalResult

classmethod from_metric_names(metric_names: List[str], **kwargs: Any) BaseRetrievalEvaluator

Create evaluator from metric names.

Parameters
  • metric_names (List[str]) – List of metric names

  • **kwargs – Additional arguments for the evaluator

classmethod from_orm(obj: Any) Model
json(*, include: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, exclude: Optional[Union[AbstractSetIntStr, MappingIntStrAny]] = None, by_alias: bool = False, skip_defaults: Optional[bool] = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Optional[Callable[[Any], Any]] = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode

Generate a JSON representation of the model, include and exclude arguments as per dict().

encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().

classmethod parse_file(path: Union[str, Path], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod parse_obj(obj: Any) Model
classmethod parse_raw(b: Union[str, bytes], *, content_type: unicode = None, encoding: unicode = 'utf8', proto: Protocol = None, allow_pickle: bool = False) Model
classmethod schema(by_alias: bool = True, ref_template: unicode = '#/definitions/{model}') DictStrAny
classmethod schema_json(*, by_alias: bool = True, ref_template: unicode = '#/definitions/{model}', **dumps_kwargs: Any) unicode
classmethod update_forward_refs(**localns: Any) None

Try to update ForwardRefs on fields based on this Model, globalns and localns.

classmethod validate(value: Any) Model
class llama_index.evaluation.SemanticSimilarityEvaluator(service_context: Optional[ServiceContext] = None, similarity_fn: Optional[Callable[[...], float]] = None, similarity_mode: Optional[SimilarityMode] = None, similarity_threshold: float = 0.8)

Embedding similarity evaluator.

Evaluate the quality of a question answering system by comparing the similarity between embeddings of the generated answer and the reference answer.

Inspired by this paper: - Semantic Answer Similarity for Evaluating Question Answering Models

Parameters
  • service_context (Optional[ServiceContext]) – Service context.

  • similarity_threshold (float) – Embedding similarity threshold for β€œpassing”. Defaults to 0.8.

async aevaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, reference: Optional[str] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string, retrieved contexts, and generated response string.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

async aevaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string and generated Response object.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

evaluate(query: Optional[str] = None, response: Optional[str] = None, contexts: Optional[Sequence[str]] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string, retrieved contexts, and generated response string.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

evaluate_response(query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any) EvaluationResult

Run evaluation with query string and generated Response object.

Subclasses can override this method to provide custom evaluation logic and take in additional arguments.

get_prompts() Dict[str, BasePromptTemplate]

Get a prompt.

update_prompts(prompts_dict: Dict[str, BasePromptTemplate]) None

Update prompts.

Other prompts will remain in place.

llama_index.evaluation.generate_qa_embedding_pairs(nodes: List[TextNode], llm: LLM, qa_generate_prompt_tmpl: str = 'Context information is below.\n\n---------------------\n{context_str}\n---------------------\n\nGiven the context information and not prior knowledge.\ngenerate only questions based on the below query.\n\nYou are a Teacher/ Professor. Your task is to setup {num_questions_per_chunk} questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. Restrict the questions to the context information provided."\n', num_questions_per_chunk: int = 2) EmbeddingQAFinetuneDataset

Generate examples given a set of nodes.

llama_index.evaluation.generate_question_context_pairs(nodes: List[TextNode], llm: LLM, qa_generate_prompt_tmpl: str = 'Context information is below.\n\n---------------------\n{context_str}\n---------------------\n\nGiven the context information and not prior knowledge.\ngenerate only questions based on the below query.\n\nYou are a Teacher/ Professor. Your task is to setup {num_questions_per_chunk} questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. Restrict the questions to the context information provided."\n', num_questions_per_chunk: int = 2) EmbeddingQAFinetuneDataset

Generate examples given a set of nodes.

llama_index.evaluation.get_retrieval_results_df(names: List[str], results_arr: List[List[RetrievalEvalResult]], metric_keys: Optional[List[str]] = None) DataFrame

Display retrieval results.

llama_index.evaluation.resolve_metrics(metrics: List[str]) List[BaseRetrievalMetric]

Resolve metrics from list of metric names.