SimpleDirectoryReader#

class llama_index.readers.SimpleDirectoryReader(input_dir: Optional[str] = None, input_files: Optional[List] = None, exclude: Optional[List] = None, exclude_hidden: bool = True, errors: str = 'ignore', recursive: bool = False, encoding: str = 'utf-8', filename_as_id: bool = False, required_exts: Optional[List[str]] = None, file_extractor: Optional[Dict[str, BaseReader]] = None, num_files_limit: Optional[int] = None, file_metadata: Optional[Callable[[str], Dict]] = None)#

Bases: BaseReader

Simple directory reader.

Load files from file directory. Automatically select the best file reader given file extensions.

Parameters

input_dir (str) – Path to the directory.
input_files (List) – List of file paths to read (Optional; overrides input_dir, exclude)
exclude (List) – glob of python file paths to exclude (Optional)
exclude_hidden (bool) – Whether to exclude hidden files (dotfiles).
encoding (str) – Encoding of the files. Default is utf-8.
errors (str) – how encoding and decoding errors are to be handled, see https://docs.python.org/3/library/functions.html#open
recursive (bool) – Whether to recursively search in subdirectories. False by default.
filename_as_id (bool) – Whether to use the filename as the document id. False by default.
required_exts (Optional[List[str]]) – List of required extensions. Default is None.
file_extractor (Optional[Dict[str, BaseReader]]) – A mapping of file extension to a BaseReader class that specifies how to convert that file to text. If not specified, use default from DEFAULT_FILE_READER_CLS.
num_files_limit (Optional[int]) – Maximum number of files to read. Default is None.
file_metadata (Optional[Callable[str, Dict]]) – A function that takes in a filename and returns a Dict of metadata for the Document. Default is None.

Attributes Summary

supported_suffix

Methods Summary

`is_hidden`(path)
`iter_data`([show_progress])	Load data iteratively from the input directory.
`load_data`([show_progress, num_workers])	Load data from the input directory.
`load_file`(input_file, file_metadata, ...[, ...])	Static method for loading file.

Attributes Documentation

supported_suffix = ['.hwp', '.pdf', '.docx', '.pptx', '.ppt', '.pptm', '.jpg', '.png', '.jpeg', '.mp3', '.mp4', '.csv', '.epub', '.md', '.mbox', '.ipynb']#

Methods Documentation

is_hidden(path: Path) → bool#

iter_data(show_progress: bool = False) → Generator[List[Document], Any, Any]#

Load data iteratively from the input directory.

Parameters: show_progress (bool) – Whether to show tqdm progress bars. Defaults to False.
Returns: A list of documents.
Return type: Generator[List[Document]]

load_data(show_progress: bool = False, num_workers: Optional[int] = None) → List[Document]#

Load data from the input directory.

Parameters: show_progress (bool) – Whether to show tqdm progress bars. Defaults to False.
Returns: A list of documents.
Return type: List[Document]

static load_file(input_file: Path, file_metadata: Callable[[str], Dict], file_extractor: Dict[str, BaseReader], filename_as_id: bool = False, encoding: str = 'utf-8', errors: str = 'ignore') → List[Document]#

Static method for loading file.

NOTE: necessarily as a static method for parallel processing.

Parameters

input_file (Path) – _description_
file_metadata (Callable[[str], Dict]) – _description_
file_extractor (Dict[str, BaseReader]) – _description_
filename_as_id (bool, optional) – _description_. Defaults to False.
encoding (str, optional) – _description_. Defaults to “utf-8”.
errors (str, optional) – _description_. Defaults to “ignore”.

input_file (Path): File path to read file_metadata ([Callable[str, Dict]]): A function that takes

in a filename and returns a Dict of metadata for the Document.

file_extractor (Dict[str, BaseReader]): A mapping of file: extension to a BaseReader class that specifies how to convert that file to text.

filename_as_id (bool): Whether to use the filename as the document id. encoding (str): Encoding of the files.

Default is utf-8.

errors (str): how encoding and decoding errors are to be handled,: see https://docs.python.org/3/library/functions.html#open

Returns: loaded documents
Return type: List[Document]