Web
Init file.
AsyncWebPageReader #
Bases: BaseReader
Asynchronous web page reader.
Reads pages from the web asynchronously.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
html_to_text
|
bool
|
Whether to convert HTML to text.
Requires |
False
|
limit
|
int
|
Maximum number of concurrent requests. |
10
|
dedupe
|
bool
|
to deduplicate urls if there is exact-match within given list |
True
|
fail_on_error
|
bool
|
if requested url does not return status code 200 the routine will raise an ValueError |
False
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/async_web/base.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
|
aload_data
async
#
aload_data(urls: List[str]) -> List[Document]
Load data from the input urls.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls
|
List[str]
|
List of URLs to scrape. |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/async_web/base.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|
load_data #
load_data(urls: List[str]) -> List[Document]
Load data from the input urls.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls
|
List[str]
|
List of URLs to scrape. |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/async_web/base.py
117 118 119 120 121 122 123 124 125 126 127 |
|
BeautifulSoupWebReader #
Bases: BasePydanticReader
BeautifulSoup web page reader.
Reads pages from the web.
Requires the bs4
and urllib
packages.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
website_extractor
|
Optional[Dict[str, Callable]]
|
A mapping of website hostname (e.g. google.com) to a function that specifies how to extract text from the BeautifulSoup obj. See DEFAULT_WEBSITE_EXTRACTOR. |
None
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/beautiful_soup_web/base.py
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
|
class_name
classmethod
#
class_name() -> str
Get the name identifier of the class.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/beautiful_soup_web/base.py
154 155 156 157 |
|
load_data #
load_data(urls: List[str], custom_hostname: Optional[str] = None, include_url_in_text: Optional[bool] = True) -> List[Document]
Load data from the urls.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls
|
List[str]
|
List of URLs to scrape. |
required |
custom_hostname
|
Optional[str]
|
Force a certain hostname in the case a website is displayed under custom URLs (e.g. Substack blogs) |
None
|
include_url_in_text
|
Optional[bool]
|
Include the reference url in the text of the document |
True
|
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/beautiful_soup_web/base.py
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
|
BrowserbaseWebReader #
Bases: BaseReader
BrowserbaseWebReader.
Load pre-rendered web pages using a headless browser hosted on Browserbase.
Depends on browserbase
package.
Get your API key from https://browserbase.com
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/browserbase_web/base.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
lazy_load_data #
lazy_load_data(urls: Sequence[str], text_content: bool = False, session_id: Optional[str] = None, proxy: Optional[bool] = None) -> Iterator[Document]
Load pages from URLs.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/browserbase_web/base.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
FireCrawlWebReader #
Bases: BasePydanticReader
turn a url to llm accessible markdown with Firecrawl.dev
.
Args: api_key: The Firecrawl API key. api_url: url to be passed to FirecrawlApp for local deployment url: The url to be crawled (or) mode: The mode to run the loader in. Default is "crawl". Options include "scrape" (single url) and "crawl" (all accessible sub pages). params: The parameters to pass to the Firecrawl API. Examples include crawlerOptions. For more details, visit: https://github.com/mendableai/firecrawl-py
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/firecrawl_web/base.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
|
load_data #
load_data(url: Optional[str] = None, query: Optional[str] = None) -> List[Document]
Load data from the input directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
Optional[str]
|
URL to scrape or crawl. |
None
|
query
|
Optional[str]
|
Query to search for. |
None
|
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Raises:
Type | Description |
---|---|
ValueError
|
If neither or both url and query are provided. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/firecrawl_web/base.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
|
KnowledgeBaseWebReader #
Bases: BaseReader
Knowledge base reader.
Crawls and reads articles from a knowledge base/help center with Playwright.
Tested on Zendesk and Intercom CMS, may work on others.
Can be run in headless mode but it may be blocked by Cloudflare. Run it headed to be safe.
Times out occasionally, just increase the default time out if it does.
Requires the playwright
package.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root_url
|
str
|
the base url of the knowledge base, with no trailing slash e.g. 'https://support.intercom.com' |
required |
link_selectors
|
List[str]
|
list of css selectors to find links to articles while crawling e.g. ['.article-list a', '.article-list a'] |
required |
article_path
|
str
|
the url path of articles on this domain so the crawler knows when to stop e.g. '/articles' |
required |
title_selector
|
Optional[str]
|
css selector to find the title of the article e.g. '.article-title' |
None
|
subtitle_selector
|
Optional[str]
|
css selector to find the subtitle/description of the article e.g. '.article-subtitle' |
None
|
body_selector
|
Optional[str]
|
css selector to find the body of the article e.g. '.article-body' |
None
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/knowledge_base/base.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
|
load_data #
load_data() -> List[Document]
Load data from the knowledge base.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/knowledge_base/base.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
scrape_article #
scrape_article(browser: Any, url: str) -> Dict[str, str]
Scrape a single article url.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
browser
|
Any
|
a Playwright Chromium browser. |
required |
url
|
str
|
URL of the article to scrape. |
required |
Returns:
Type | Description |
---|---|
Dict[str, str]
|
Dict[str, str]: a mapping of article attributes to their values. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/knowledge_base/base.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
|
get_article_urls #
get_article_urls(browser: Any, root_url: str, current_url: str) -> List[str]
Recursively crawl through the knowledge base to find a list of articles.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
browser
|
Any
|
a Playwright Chromium browser. |
required |
root_url
|
str
|
root URL of the knowledge base. |
required |
current_url
|
str
|
current URL that is being crawled. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: a list of URLs of found articles. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/knowledge_base/base.py
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
|
MainContentExtractorReader #
Bases: BaseReader
MainContentExtractor web page reader.
Reads pages from the web.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text_format
|
str
|
The format of the text. Defaults to "markdown".
Requires |
'markdown'
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/main_content_extractor/base.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
load_data #
load_data(urls: List[str]) -> List[Document]
Load data from the input directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls
|
List[str]
|
List of URLs to scrape. |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/main_content_extractor/base.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
NewsArticleReader #
Bases: BaseReader
Simple news article reader.
Reads news articles from the web and parses them using the newspaper
library.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text_mode
|
bool
|
Whether to load a text version or HTML version of the content (default=True). |
True
|
use_nlp
|
bool
|
Whether to use NLP to extract additional summary and keywords (default=True). |
True
|
newspaper_kwargs
|
Any
|
Additional keyword arguments to pass to newspaper.Article. See https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#article |
{}
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/news/base.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
|
load_data #
load_data(urls: List[str]) -> List[Document]
Load data from the list of news article urls.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls
|
List[str]
|
List of URLs to load news articles. |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/news/base.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
|
ReadabilityWebPageReader #
Bases: BaseReader
Readability Webpage Loader.
Extracting relevant information from a fully rendered web page. During the processing, it is always assumed that web pages used as data sources contain textual content.
- Load the page and wait for it rendered. (playwright)
- Inject Readability.js to extract the main content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
proxy
|
Optional[str]
|
Proxy server. Defaults to None. |
None
|
wait_until
|
Optional[Literal['commit', 'domcontentloaded', 'load', 'networkidle']]
|
Wait until the page is loaded. Defaults to "domcontentloaded". |
'domcontentloaded'
|
text_splitter
|
TextSplitter
|
Text splitter. Defaults to None. |
None
|
normalizer
|
Optional[Callable[[str], str]]
|
Text normalizer. Defaults to nfkc_normalize. |
required |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/readability_web/base.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
async_load_data
async
#
async_load_data(url: str) -> List[Document]
Render and load data content from url.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
URL to scrape. |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/readability_web/base.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
|
scrape_page
async
#
scrape_page(browser: Browser, url: str) -> Dict[str, str]
Scrape a single article url.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
browser
|
Any
|
a Playwright Chromium browser. |
required |
url
|
str
|
URL of the article to scrape. |
required |
Returns:
Name | Type | Description |
---|---|---|
Ref |
Dict[str, str]
|
https://github.com/mozilla/readability |
title |
Dict[str, str]
|
article title; |
content |
Dict[str, str]
|
HTML string of processed article content; |
textContent |
Dict[str, str]
|
text content of the article, with all the HTML tags removed; |
length |
Dict[str, str]
|
length of an article, in characters; |
excerpt |
Dict[str, str]
|
article description, or short excerpt from the content; |
byline |
Dict[str, str]
|
author metadata; |
dir |
Dict[str, str]
|
content direction; |
siteName |
Dict[str, str]
|
name of the site. |
lang |
Dict[str, str]
|
content language |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/readability_web/base.py
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
RssNewsReader #
Bases: BaseReader
RSS news reader.
Reads news content from RSS feeds and parses with NewsArticleReader.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/rss_news/base.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
|
load_data #
load_data(urls: List[str] = None, opml: str = None) -> List[Document]
Load data from either RSS feeds or OPML.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls
|
List[str]
|
List of RSS URLs to load. |
None
|
opml
|
str
|
URL to OPML file or string or byte OPML content. |
None
|
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/rss_news/base.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
|
RssReader #
Bases: BasePydanticReader
RSS reader.
Reads content from an RSS feed.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/rss/base.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
load_data #
load_data(urls: List[str]) -> List[Document]
Load data from RSS feeds.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls
|
List[str]
|
List of RSS URLs to load. |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/rss/base.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
ScrapflyReader #
Bases: BasePydanticReader
Turn a url to llm accessible markdown with Scrapfly.io
.
Args: api_key: The Scrapfly API key. scrape_config: The Scrapfly ScrapeConfig object. ignore_scrape_failures: Whether to continue on failures. urls: List of urls to scrape. scrape_format: Scrape result format (markdown or text) For further details, visit: https://scrapfly.io/docs/sdk/python
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/scrapfly_web/base.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
|
load_data #
load_data(urls: List[str], scrape_format: Literal['markdown', 'text'] = 'markdown', scrape_config: Optional[dict] = None) -> List[Document]
Load data from the urls.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls
|
List[str]
|
List[str]): List of URLs to scrape. |
required |
scrape_config
|
Optional[dict]
|
Optional[dict]: Dictionary of ScrapFly scrape config object. |
None
|
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Raises:
Type | Description |
---|---|
ValueError
|
If URLs aren't provided. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/scrapfly_web/base.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
|
SimpleWebPageReader #
Bases: BasePydanticReader
Simple web page reader.
Reads pages from the web.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
html_to_text
|
bool
|
Whether to convert HTML to text.
Requires |
False
|
metadata_fn
|
Optional[Callable[[str], Dict]]
|
A function that takes in a URL and returns a dictionary of metadata. Default is None. |
None
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/simple_web/base.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
|
load_data #
load_data(urls: List[str]) -> List[Document]
Load data from the input directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls
|
List[str]
|
List of URLs to scrape. |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/simple_web/base.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
|
SitemapReader #
Bases: BaseReader
Asynchronous sitemap reader for web.
Reads pages from the web based on their sitemap.xml.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sitemap_url
|
string
|
Path to the sitemap.xml. e.g. https://gpt-index.readthedocs.io/sitemap.xml |
required |
html_to_text
|
bool
|
Whether to convert HTML to text.
Requires |
False
|
limit
|
int
|
Maximum number of concurrent requests. |
10
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/sitemap/base.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
|
TrafilaturaWebReader #
Bases: BasePydanticReader
Trafilatura web page reader.
Reads pages from the web.
Requires the trafilatura
package.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/trafilatura_web/base.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
|
class_name
classmethod
#
class_name() -> str
Get the name identifier of the class.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/trafilatura_web/base.py
17 18 19 20 |
|
load_data #
load_data(urls: List[str], include_comments=True, output_format='txt', include_tables=True, include_images=False, include_formatting=False, include_links=False, show_progress=False, no_ssl=False, **kwargs) -> List[Document]
Load data from the urls.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls
|
List[str]
|
List of URLs to scrape. |
required |
include_comments
|
bool
|
Include comments in the output. Defaults to True. |
True
|
output_format
|
str
|
Output format. Defaults to 'txt'. |
'txt'
|
include_tables
|
bool
|
Include tables in the output. Defaults to True. |
True
|
include_images
|
bool
|
Include images in the output. Defaults to False. |
False
|
include_formatting
|
bool
|
Include formatting in the output. Defaults to False. |
False
|
include_links
|
bool
|
Include links in the output. Defaults to False. |
False
|
show_progress
|
bool
|
Show progress bar. Defaults to False |
False
|
no_ssl
|
bool
|
Bypass SSL verification. Defaults to False. |
False
|
kwargs
|
Additional keyword arguments for the |
{}
|
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/trafilatura_web/base.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
|
UnstructuredURLLoader #
Bases: BaseReader
Loader that uses unstructured to load HTML files.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/unstructured_web/base.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
load_data #
load_data() -> List[Document]
Load file.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/unstructured_web/base.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
WholeSiteReader #
Bases: BaseReader
BFS Web Scraper for websites.
This class provides functionality to scrape entire websites using a breadth-first search algorithm. It navigates web pages from a given base URL, following links that match a specified prefix.
Attributes:
Name | Type | Description |
---|---|---|
prefix |
str
|
URL prefix to focus the scraping. |
max_depth |
int
|
Maximum depth for BFS algorithm. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prefix
|
str
|
URL prefix for scraping. |
required |
max_depth
|
int
|
Maximum depth for BFS. Defaults to 10. |
10
|
uri_as_id
|
bool
|
Whether to use the URI as the document ID. Defaults to False. |
False
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/whole_site/base.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
|
setup_driver #
setup_driver()
Sets up the Selenium WebDriver for Chrome.
Returns:
Name | Type | Description |
---|---|---|
WebDriver |
An instance of Chrome WebDriver. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/whole_site/base.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
|
load_data #
load_data(base_url: str) -> List[Document]
Load data from the base URL using BFS algorithm.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
base_url
|
str
|
Base URL to start scraping. |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of scraped documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/whole_site/base.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
|