Simple Directory Reader#
The SimpleDirectoryReader
is the most commonly used data connector that just works.
Simply pass in a input directory or a list of files.
It will select the best file reader based on the file extensions.
Get Started#
If you’re opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
!pip install llama-index
Download Data
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay1.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay2.txt'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay3.txt'
Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/loganm/.wget-hsts'. HSTS will be disabled.
--2023-12-21 16:34:53-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay1.txt’
data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.04s
2023-12-21 16:34:53 (1.81 MB/s) - ‘data/paul_graham/paul_graham_essay1.txt’ saved [75042/75042]
Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/loganm/.wget-hsts'. HSTS will be disabled.
--2023-12-21 16:34:53-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay2.txt’
data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.04s
2023-12-21 16:34:53 (1.78 MB/s) - ‘data/paul_graham/paul_graham_essay2.txt’ saved [75042/75042]
Will not apply HSTS. The HSTS database must be a regular and non-world-writable file.
ERROR: could not open HSTS store at '/home/loganm/.wget-hsts'. HSTS will be disabled.
--2023-12-21 16:34:54-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay3.txt’
data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.04s
2023-12-21 16:34:54 (1.80 MB/s) - ‘data/paul_graham/paul_graham_essay3.txt’ saved [75042/75042]
from llama_index import SimpleDirectoryReader
Load specific files
reader = SimpleDirectoryReader(
input_files=["./data/paul_graham/paul_graham_essay1.txt"]
)
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")
Loaded 1 docs
Load all (top-level) files from directory
reader = SimpleDirectoryReader(input_dir="./data/paul_graham/")
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")
Loaded 3 docs
Load all (recursive) files from directory
!mkdir -p 'data/paul_graham/nested'
!echo "This is a nested file" > 'data/paul_graham/nested/nested_file.md'
# only load markdown files
required_exts = [".md"]
reader = SimpleDirectoryReader(
input_dir="./data",
required_exts=required_exts,
recursive=True,
)
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")
Loaded 1 docs
Create an iterator to load files and process them as they load
reader = SimpleDirectoryReader(
input_dir="./data",
recursive=True,
)
all_docs = []
for docs in reader.iter_data():
for doc in docs:
# do something with the doc
doc.text = doc.text.upper()
all_docs.append(doc)
print(len(all_docs))
4
Full Configuration#
This is the full list of arguments that can be passed to the SimpleDirectoryReader
:
class SimpleDirectoryReader(BaseReader):
"""Simple directory reader.
Load files from file directory.
Automatically select the best file reader given file extensions.
Args:
input_dir (str): Path to the directory.
input_files (List): List of file paths to read
(Optional; overrides input_dir, exclude)
exclude (List): glob of python file paths to exclude (Optional)
exclude_hidden (bool): Whether to exclude hidden files (dotfiles).
encoding (str): Encoding of the files.
Default is utf-8.
errors (str): how encoding and decoding errors are to be handled,
see https://docs.python.org/3/library/functions.html#open
recursive (bool): Whether to recursively search in subdirectories.
False by default.
filename_as_id (bool): Whether to use the filename as the document id.
False by default.
required_exts (Optional[List[str]]): List of required extensions.
Default is None.
file_extractor (Optional[Dict[str, BaseReader]]): A mapping of file
extension to a BaseReader class that specifies how to convert that file
to text. If not specified, use default from DEFAULT_FILE_READER_CLS.
num_files_limit (Optional[int]): Maximum number of files to read.
Default is None.
file_metadata (Optional[Callable[str, Dict]]): A function that takes
in a filename and returns a Dict of metadata for the Document.
Default is None.
"""