SemanticSplitterNodeParser#

pydantic model llama_index.core.node_parser.SemanticSplitterNodeParser#

Semantic node parser.

Splits a document into Nodes, with each node being a group of semantically related sentences.

Parameters
  • buffer_size (int) – number of sentences to group together when evaluating semantic similarity

  • embed_model – (BaseEmbedding): embedding model to use

  • sentence_splitter (Optional[Callable]) – splits text into sentences

  • include_metadata (bool) – whether to include metadata in nodes

  • include_prev_next_rel (bool) – whether to include prev/next relationships

Show JSON schema
{
   "title": "SemanticSplitterNodeParser",
   "description": "Semantic node parser.\n\nSplits a document into Nodes, with each node being a group of semantically related sentences.\n\nArgs:\n    buffer_size (int): number of sentences to group together when evaluating semantic similarity\n    embed_model: (BaseEmbedding): embedding model to use\n    sentence_splitter (Optional[Callable]): splits text into sentences\n    include_metadata (bool): whether to include metadata in nodes\n    include_prev_next_rel (bool): whether to include prev/next relationships",
   "type": "object",
   "properties": {
      "include_metadata": {
         "title": "Include Metadata",
         "description": "Whether or not to consider metadata when splitting.",
         "default": true,
         "type": "boolean"
      },
      "include_prev_next_rel": {
         "title": "Include Prev Next Rel",
         "description": "Include prev/next node relationships.",
         "default": true,
         "type": "boolean"
      },
      "callback_manager": {
         "title": "Callback Manager",
         "type": "object",
         "default": {}
      },
      "embed_model": {
         "title": "Embed Model",
         "description": "The embedding model to use to for semantic comparison",
         "allOf": [
            {
               "$ref": "#/definitions/BaseEmbedding"
            }
         ]
      },
      "buffer_size": {
         "title": "Buffer Size",
         "description": "The number of sentences to group together when evaluating semantic similarity. Set to 1 to consider each sentence individually. Set to >1 to group sentences together.",
         "default": 1,
         "type": "integer"
      },
      "breakpoint_percentile_threshold": {
         "title": "Breakpoint Percentile Threshold",
         "description": "The percentile of cosine dissimilarity that must be exceeded between a group of sentences and the next to form a node.  The smaller this number is, the more nodes will be generated",
         "default": 95,
         "type": "integer"
      },
      "class_name": {
         "title": "Class Name",
         "type": "string",
         "default": "SemanticSplitterNodeParser"
      }
   },
   "required": [
      "embed_model"
   ],
   "definitions": {
      "BaseEmbedding": {
         "title": "BaseEmbedding",
         "description": "Base class for embeddings.",
         "type": "object",
         "properties": {
            "model_name": {
               "title": "Model Name",
               "description": "The name of the embedding model.",
               "default": "unknown",
               "type": "string"
            },
            "embed_batch_size": {
               "title": "Embed Batch Size",
               "description": "The batch size for embedding calls.",
               "default": 10,
               "exclusiveMinimum": 0,
               "lte": 2048,
               "type": "integer"
            },
            "callback_manager": {
               "title": "Callback Manager",
               "type": "object",
               "default": {}
            },
            "class_name": {
               "title": "Class Name",
               "type": "string",
               "default": "base_component"
            }
         }
      }
   }
}

Config
  • arbitrary_types_allowed: bool = True

Fields
  • buffer_size (int)

  • embed_model (llama_index.core.base.embeddings.base.BaseEmbedding)

  • sentence_splitter (Callable[[str], List[str]])

Validators
  • _validate_id_func » id_func

field buffer_size: int = 1#

The number of sentences to group together when evaluating semantic similarity. Set to 1 to consider each sentence individually. Set to >1 to group sentences together.

field embed_model: BaseEmbedding [Required]#

The embedding model to use to for semantic comparison

field sentence_splitter: Callable[[str], List[str]] [Optional]#

The text splitter to use when splitting documents.

build_semantic_nodes_from_documents(documents: Sequence[Document], show_progress: bool = False) List[BaseNode]#

Build window nodes from documents.

classmethod class_name() str#

Get the class name, used as a unique ID in serialization.

This provides a key that makes serialization robust against actual class name changes.

classmethod from_defaults(embed_model: Optional[BaseEmbedding] = None, breakpoint_percentile_threshold: Optional[int] = 95, buffer_size: Optional[int] = 1, sentence_splitter: Optional[Callable[[str], List[str]]] = None, original_text_metadata_key: str = 'original_text', include_metadata: bool = True, include_prev_next_rel: bool = True, callback_manager: Optional[CallbackManager] = None, id_func: Optional[Callable[[int, Document], str]] = None) SemanticSplitterNodeParser#