SimpleDirectoryReader#
- class llama_index.readers.SimpleDirectoryReader(input_dir: Optional[str] = None, input_files: Optional[List] = None, exclude: Optional[List] = None, exclude_hidden: bool = True, errors: str = 'ignore', recursive: bool = False, encoding: str = 'utf-8', filename_as_id: bool = False, required_exts: Optional[List[str]] = None, file_extractor: Optional[Dict[str, BaseReader]] = None, num_files_limit: Optional[int] = None, file_metadata: Optional[Callable[[str], Dict]] = None)#
Bases:
BaseReader
Simple directory reader.
Load files from file directory. Automatically select the best file reader given file extensions.
- Parameters
input_dir (str) β Path to the directory.
input_files (List) β List of file paths to read (Optional; overrides input_dir, exclude)
exclude (List) β glob of python file paths to exclude (Optional)
exclude_hidden (bool) β Whether to exclude hidden files (dotfiles).
encoding (str) β Encoding of the files. Default is utf-8.
errors (str) β how encoding and decoding errors are to be handled, see https://docs.python.org/3/library/functions.html#open
recursive (bool) β Whether to recursively search in subdirectories. False by default.
filename_as_id (bool) β Whether to use the filename as the document id. False by default.
required_exts (Optional[List[str]]) β List of required extensions. Default is None.
file_extractor (Optional[Dict[str, BaseReader]]) β A mapping of file extension to a BaseReader class that specifies how to convert that file to text. If not specified, use default from DEFAULT_FILE_READER_CLS.
num_files_limit (Optional[int]) β Maximum number of files to read. Default is None.
file_metadata (Optional[Callable[str, Dict]]) β A function that takes in a filename and returns a Dict of metadata for the Document. Default is None.
Attributes Summary
Methods Summary
is_hidden
(path)iter_data
([show_progress])Load data iteratively from the input directory.
load_data
([show_progress, num_workers])Load data from the input directory.
load_file
(input_file, file_metadata, ...[, ...])Static method for loading file.
Attributes Documentation
- supported_suffix = ['.hwp', '.pdf', '.docx', '.pptx', '.ppt', '.pptm', '.jpg', '.png', '.jpeg', '.mp3', '.mp4', '.csv', '.epub', '.md', '.mbox', '.ipynb']#
Methods Documentation
- iter_data(show_progress: bool = False) Generator[List[Document], Any, Any] #
Load data iteratively from the input directory.
- Parameters
show_progress (bool) β Whether to show tqdm progress bars. Defaults to False.
- Returns
A list of documents.
- Return type
Generator[List[Document]]
- load_data(show_progress: bool = False, num_workers: Optional[int] = None) List[Document] #
Load data from the input directory.
- Parameters
show_progress (bool) β Whether to show tqdm progress bars. Defaults to False.
- Returns
A list of documents.
- Return type
List[Document]
- static load_file(input_file: Path, file_metadata: Callable[[str], Dict], file_extractor: Dict[str, BaseReader], filename_as_id: bool = False, encoding: str = 'utf-8', errors: str = 'ignore') List[Document] #
Static method for loading file.
NOTE: necessarily as a static method for parallel processing.
- Parameters
input_file (Path) β _description_
file_metadata (Callable[[str], Dict]) β _description_
file_extractor (Dict[str, BaseReader]) β _description_
filename_as_id (bool, optional) β _description_. Defaults to False.
encoding (str, optional) β _description_. Defaults to βutf-8β.
errors (str, optional) β _description_. Defaults to βignoreβ.
input_file (Path): File path to read file_metadata ([Callable[str, Dict]]): A function that takes
in a filename and returns a Dict of metadata for the Document.
- file_extractor (Dict[str, BaseReader]): A mapping of file
extension to a BaseReader class that specifies how to convert that file to text.
filename_as_id (bool): Whether to use the filename as the document id. encoding (str): Encoding of the files.
Default is utf-8.
- errors (str): how encoding and decoding errors are to be handled,
- Returns
loaded documents
- Return type
List[Document]