Web
Init file.
AsyncWebPageReader #
Bases: BaseReader
Asynchronous web page reader.
Reads pages from the web asynchronously.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
html_to_text |
bool
|
Whether to convert HTML to text.
Requires |
False
|
limit |
int
|
Maximum number of concurrent requests. |
10
|
dedupe |
bool
|
to deduplicate urls if there is exact-match within given list |
True
|
fail_on_error |
bool
|
if requested url does not return status code 200 the routine will raise an ValueError |
False
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/async_web/base.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|
load_data #
load_data(urls: List[str]) -> List[Document]
Load data from the input urls.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls |
List[str]
|
List of URLs to scrape. |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/async_web/base.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|
BeautifulSoupWebReader #
Bases: BasePydanticReader
BeautifulSoup web page reader.
Reads pages from the web.
Requires the bs4
and urllib
packages.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
website_extractor |
Optional[Dict[str, Callable]]
|
A mapping of website hostname (e.g. google.com) to a function that specifies how to extract text from the BeautifulSoup obj. See DEFAULT_WEBSITE_EXTRACTOR. |
None
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/beautiful_soup_web/base.py
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
|
class_name
classmethod
#
class_name() -> str
Get the name identifier of the class.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/beautiful_soup_web/base.py
154 155 156 157 |
|
load_data #
load_data(urls: List[str], custom_hostname: Optional[str] = None, include_url_in_text: Optional[bool] = True) -> List[Document]
Load data from the urls.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls |
List[str]
|
List of URLs to scrape. |
required |
custom_hostname |
Optional[str]
|
Force a certain hostname in the case a website is displayed under custom URLs (e.g. Substack blogs) |
None
|
include_url_in_text |
Optional[bool]
|
Include the reference url in the text of the document |
True
|
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/beautiful_soup_web/base.py
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
|
KnowledgeBaseWebReader #
Bases: BaseReader
Knowledge base reader.
Crawls and reads articles from a knowledge base/help center with Playwright.
Tested on Zendesk and Intercom CMS, may work on others.
Can be run in headless mode but it may be blocked by Cloudflare. Run it headed to be safe.
Times out occasionally, just increase the default time out if it does.
Requires the playwright
package.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root_url |
str
|
the base url of the knowledge base, with no trailing slash e.g. 'https://support.intercom.com' |
required |
link_selectors |
List[str]
|
list of css selectors to find links to articles while crawling e.g. ['.article-list a', '.article-list a'] |
required |
article_path |
str
|
the url path of articles on this domain so the crawler knows when to stop e.g. '/articles' |
required |
title_selector |
Optional[str]
|
css selector to find the title of the article e.g. '.article-title' |
None
|
subtitle_selector |
Optional[str]
|
css selector to find the subtitle/description of the article e.g. '.article-subtitle' |
None
|
body_selector |
Optional[str]
|
css selector to find the body of the article e.g. '.article-body' |
None
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/knowledge_base/base.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
|
load_data #
load_data() -> List[Document]
Load data from the knowledge base.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/knowledge_base/base.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
scrape_article #
scrape_article(browser: Any, url: str) -> Dict[str, str]
Scrape a single article url.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
browser |
Any
|
a Playwright Chromium browser. |
required |
url |
str
|
URL of the article to scrape. |
required |
Returns:
Type | Description |
---|---|
Dict[str, str]
|
Dict[str, str]: a mapping of article attributes to their values. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/knowledge_base/base.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
|
get_article_urls #
get_article_urls(browser: Any, root_url: str, current_url: str) -> List[str]
Recursively crawl through the knowledge base to find a list of articles.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
browser |
Any
|
a Playwright Chromium browser. |
required |
root_url |
str
|
root URL of the knowledge base. |
required |
current_url |
str
|
current URL that is being crawled. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: a list of URLs of found articles. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/knowledge_base/base.py
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
|
MainContentExtractorReader #
Bases: BaseReader
MainContentExtractor web page reader.
Reads pages from the web.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text_format |
str
|
The format of the text. Defaults to "markdown".
Requires |
'markdown'
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/main_content_extractor/base.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
load_data #
load_data(urls: List[str]) -> List[Document]
Load data from the input directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls |
List[str]
|
List of URLs to scrape. |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/main_content_extractor/base.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
NewsArticleReader #
Bases: BaseReader
Simple news article reader.
Reads news articles from the web and parses them using the newspaper
library.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text_mode |
bool
|
Whether to load a text version or HTML version of the content (default=True). |
True
|
use_nlp |
bool
|
Whether to use NLP to extract additional summary and keywords (default=True). |
True
|
newspaper_kwargs |
Any
|
Additional keyword arguments to pass to newspaper.Article. See https://newspaper.readthedocs.io/en/latest/user_guide/quickstart.html#article |
{}
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/news/base.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
|
load_data #
load_data(urls: List[str]) -> List[Document]
Load data from the list of news article urls.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls |
List[str]
|
List of URLs to load news articles. |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/news/base.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
|
ReadabilityWebPageReader #
Bases: BaseReader
Readability Webpage Loader.
Extracting relevant information from a fully rendered web page. During the processing, it is always assumed that web pages used as data sources contain textual content.
- Load the page and wait for it rendered. (playwright)
- Inject Readability.js to extract the main content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
proxy |
Optional[str]
|
Proxy server. Defaults to None. |
None
|
wait_until |
Optional[Literal['commit', 'domcontentloaded', 'load', 'networkidle']]
|
Wait until the page is loaded. Defaults to "domcontentloaded". |
'domcontentloaded'
|
text_splitter |
TextSplitter
|
Text splitter. Defaults to None. |
None
|
normalizer |
Optional[Callable[[str], str]]
|
Text normalizer. Defaults to nfkc_normalize. |
required |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/readability_web/base.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
|
load_data #
load_data(url: str) -> List[Document]
Render and load data content from url.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
URL to scrape. |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/readability_web/base.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
|
scrape_page #
scrape_page(browser: Any, url: str) -> Dict[str, str]
Scrape a single article url.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
browser |
Any
|
a Playwright Chromium browser. |
required |
url |
str
|
URL of the article to scrape. |
required |
Returns:
Name | Type | Description |
---|---|---|
Ref |
Dict[str, str]
|
https://github.com/mozilla/readability |
title |
Dict[str, str]
|
article title; |
content |
Dict[str, str]
|
HTML string of processed article content; |
textContent |
Dict[str, str]
|
text content of the article, with all the HTML tags removed; |
length |
Dict[str, str]
|
length of an article, in characters; |
excerpt |
Dict[str, str]
|
article description, or short excerpt from the content; |
byline |
Dict[str, str]
|
author metadata; |
dir |
Dict[str, str]
|
content direction; |
siteName |
Dict[str, str]
|
name of the site. |
lang |
Dict[str, str]
|
content language |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/readability_web/base.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
|
RssNewsReader #
Bases: BaseReader
RSS news reader.
Reads news content from RSS feeds and parses with NewsArticleReader.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/rss_news/base.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
|
load_data #
load_data(urls: List[str] = None, opml: str = None) -> List[Document]
Load data from either RSS feeds or OPML.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls |
List[str]
|
List of RSS URLs to load. |
None
|
opml |
str
|
URL to OPML file or string or byte OPML content. |
None
|
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/rss_news/base.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
|
RssReader #
Bases: BasePydanticReader
RSS reader.
Reads content from an RSS feed.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/rss/base.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
load_data #
load_data(urls: List[str]) -> List[Document]
Load data from RSS feeds.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls |
List[str]
|
List of RSS URLs to load. |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/rss/base.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
SimpleWebPageReader #
Bases: BasePydanticReader
Simple web page reader.
Reads pages from the web.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
html_to_text |
bool
|
Whether to convert HTML to text.
Requires |
False
|
metadata_fn |
Optional[Callable[[str], Dict]]
|
A function that takes in a URL and returns a dictionary of metadata. Default is None. |
None
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/simple_web/base.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
|
load_data #
load_data(urls: List[str]) -> List[Document]
Load data from the input directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls |
List[str]
|
List of URLs to scrape. |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/simple_web/base.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
|
SitemapReader #
Bases: BaseReader
Asynchronous sitemap reader for web.
Reads pages from the web based on their sitemap.xml.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sitemap_url |
string
|
Path to the sitemap.xml. e.g. https://gpt-index.readthedocs.io/sitemap.xml |
required |
html_to_text |
bool
|
Whether to convert HTML to text.
Requires |
False
|
limit |
int
|
Maximum number of concurrent requests. |
10
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/sitemap/base.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
|
TrafilaturaWebReader #
Bases: BasePydanticReader
Trafilatura web page reader.
Reads pages from the web.
Requires the trafilatura
package.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/trafilatura_web/base.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
class_name
classmethod
#
class_name() -> str
Get the name identifier of the class.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/trafilatura_web/base.py
17 18 19 20 |
|
load_data #
load_data(urls: List[str], include_comments=True, output_format='txt', include_tables=True, include_images=False, include_formatting=False, include_links=False) -> List[Document]
Load data from the urls.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
urls |
List[str]
|
List of URLs to scrape. |
required |
include_comments |
bool
|
Include comments in the output. Defaults to True. |
True
|
output_format |
str
|
Output format. Defaults to 'txt'. |
'txt'
|
include_tables |
bool
|
Include tables in the output. Defaults to True. |
True
|
include_images |
bool
|
Include images in the output. Defaults to False. |
False
|
include_formatting |
bool
|
Include formatting in the output. Defaults to False. |
False
|
include_links |
bool
|
Include links in the output. Defaults to False. |
False
|
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/trafilatura_web/base.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
UnstructuredURLLoader #
Bases: BaseReader
Loader that uses unstructured to load HTML files.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/unstructured_web/base.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
load_data #
load_data() -> List[Document]
Load file.
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/unstructured_web/base.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
WholeSiteReader #
Bases: BaseReader
BFS Web Scraper for websites.
This class provides functionality to scrape entire websites using a breadth-first search algorithm. It navigates web pages from a given base URL, following links that match a specified prefix.
Attributes:
Name | Type | Description |
---|---|---|
prefix |
str
|
URL prefix to focus the scraping. |
max_depth |
int
|
Maximum depth for BFS algorithm. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prefix |
str
|
URL prefix for scraping. |
required |
max_depth |
int
|
Maximum depth for BFS. Defaults to 10. |
10
|
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/whole_site/base.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
|
setup_driver #
setup_driver()
Sets up the Selenium WebDriver for Chrome.
Returns:
Name | Type | Description |
---|---|---|
WebDriver |
An instance of Chrome WebDriver. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/whole_site/base.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
load_data #
load_data(base_url: str) -> List[Document]
Load data from the base URL using BFS algorithm.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
base_url |
str
|
Base URL to start scraping. |
required |
Returns:
Type | Description |
---|---|
List[Document]
|
List[Document]: List of scraped documents. |
Source code in llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/whole_site/base.py
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
|