SimpleWebPageReader#

pydantic model llama_index.readers.SimpleWebPageReader#

Simple web page reader.

Reads pages from the web.

Parameters

html_to_text (bool) – Whether to convert HTML to text. Requires html2text package.
metadata_fn (Optional[Callable[[str], Dict]]) – A function that takes in a URL and returns a dictionary of metadata. Default is None.

Show JSON schema

{
   "title": "SimpleWebPageReader",
   "description": "Simple web page reader.\n\nReads pages from the web.\n\nArgs:\n    html_to_text (bool): Whether to convert HTML to text.\n        Requires `html2text` package.\n    metadata_fn (Optional[Callable[[str], Dict]]): A function that takes in\n        a URL and returns a dictionary of metadata.\n        Default is None.",
   "type": "object",
   "properties": {
      "is_remote": {
         "title": "Is Remote",
         "default": true,
         "type": "boolean"
      },
      "html_to_text": {
         "title": "Html To Text",
         "type": "boolean"
      },
      "class_name": {
         "title": "Class Name",
         "type": "string",
         "default": "SimpleWebPageReader"
      }
   },
   "required": [
      "html_to_text"
   ]
}

Config

arbitrary_types_allowed: bool = True

Fields

html_to_text (bool)
is_remote (bool)

field html_to_text: bool [Required]#

field is_remote: bool = True#

classmethod class_name() → str#

Get the class name, used as a unique ID in serialization.

This provides a key that makes serialization robust against actual class name changes.

load_data(urls: List[str]) → List[Document]#

Load data from the input directory.

Parameters: urls (List[str]) – List of URLs to scrape.
Returns: List of documents.
Return type: List[Document]