BeautifulSoupWebReader#
- pydantic model llama_index.readers.BeautifulSoupWebReader#
BeautifulSoup web page reader.
Reads pages from the web. Requires the bs4 and urllib packages.
- Parameters
website_extractor (Optional[Dict[str, Callable]]) – A mapping of website hostname (e.g. google.com) to a function that specifies how to extract text from the BeautifulSoup obj. See DEFAULT_WEBSITE_EXTRACTOR.
Show JSON schema
{ "title": "BeautifulSoupWebReader", "description": "BeautifulSoup web page reader.\n\nReads pages from the web.\nRequires the `bs4` and `urllib` packages.\n\nArgs:\n website_extractor (Optional[Dict[str, Callable]]): A mapping of website\n hostname (e.g. google.com) to a function that specifies how to\n extract text from the BeautifulSoup obj. See DEFAULT_WEBSITE_EXTRACTOR.", "type": "object", "properties": { "is_remote": { "title": "Is Remote", "default": true, "type": "boolean" }, "class_name": { "title": "Class Name", "type": "string", "default": "BeautifulSoupWebReader" } } }
- Config
arbitrary_types_allowed: bool = True
- Fields
is_remote (bool)
- field is_remote: bool = True#
- classmethod class_name() str #
Get the class name, used as a unique ID in serialization.
This provides a key that makes serialization robust against actual class name changes.
- load_data(urls: List[str], custom_hostname: Optional[str] = None) List[Document] #
Load data from the urls.
- Parameters
urls (List[str]) – List of URLs to scrape.
custom_hostname (Optional[str]) – Force a certain hostname in the case a website is displayed under custom URLs (e.g. Substack blogs)
- Returns
List of documents.
- Return type
List[Document]