SimpleDirectoryReader#

class llama_index.readers.SimpleDirectoryReader(input_dir: Optional[str] = None, input_files: Optional[List] = None, exclude: Optional[List] = None, exclude_hidden: bool = True, errors: str = 'ignore', recursive: bool = False, encoding: str = 'utf-8', filename_as_id: bool = False, required_exts: Optional[List[str]] = None, file_extractor: Optional[Dict[str, BaseReader]] = None, num_files_limit: Optional[int] = None, file_metadata: Optional[Callable[[str], Dict]] = None)#

Bases: BaseReader

Simple directory reader.

Load files from file directory. Automatically select the best file reader given file extensions.

Parameters
  • input_dir (str) – Path to the directory.

  • input_files (List) – List of file paths to read (Optional; overrides input_dir, exclude)

  • exclude (List) – glob of python file paths to exclude (Optional)

  • exclude_hidden (bool) – Whether to exclude hidden files (dotfiles).

  • encoding (str) – Encoding of the files. Default is utf-8.

  • errors (str) – how encoding and decoding errors are to be handled, see https://docs.python.org/3/library/functions.html#open

  • recursive (bool) – Whether to recursively search in subdirectories. False by default.

  • filename_as_id (bool) – Whether to use the filename as the document id. False by default.

  • required_exts (Optional[List[str]]) – List of required extensions. Default is None.

  • file_extractor (Optional[Dict[str, BaseReader]]) – A mapping of file extension to a BaseReader class that specifies how to convert that file to text. If not specified, use default from DEFAULT_FILE_READER_CLS.

  • num_files_limit (Optional[int]) – Maximum number of files to read. Default is None.

  • file_metadata (Optional[Callable[str, Dict]]) – A function that takes in a filename and returns a Dict of metadata for the Document. Default is None.

Attributes Summary

Methods Summary

is_hidden(path)

iter_data([show_progress])

Load data iteratively from the input directory.

load_data([show_progress, num_workers])

Load data from the input directory.

load_file(input_file, file_metadata, ...[, ...])

Static method for loading file.

Attributes Documentation

supported_suffix = ['.hwp', '.pdf', '.docx', '.pptx', '.ppt', '.pptm', '.jpg', '.png', '.jpeg', '.mp3', '.mp4', '.csv', '.epub', '.md', '.mbox', '.ipynb']#

Methods Documentation

is_hidden(path: Path) bool#
iter_data(show_progress: bool = False) Generator[List[Document], Any, Any]#

Load data iteratively from the input directory.

Parameters

show_progress (bool) – Whether to show tqdm progress bars. Defaults to False.

Returns

A list of documents.

Return type

Generator[List[Document]]

load_data(show_progress: bool = False, num_workers: Optional[int] = None) List[Document]#

Load data from the input directory.

Parameters

show_progress (bool) – Whether to show tqdm progress bars. Defaults to False.

Returns

A list of documents.

Return type

List[Document]

static load_file(input_file: Path, file_metadata: Callable[[str], Dict], file_extractor: Dict[str, BaseReader], filename_as_id: bool = False, encoding: str = 'utf-8', errors: str = 'ignore') List[Document]#

Static method for loading file.

NOTE: necessarily as a static method for parallel processing.

Parameters
  • input_file (Path) – _description_

  • file_metadata (Callable[[str], Dict]) – _description_

  • file_extractor (Dict[str, BaseReader]) – _description_

  • filename_as_id (bool, optional) – _description_. Defaults to False.

  • encoding (str, optional) – _description_. Defaults to β€œutf-8”.

  • errors (str, optional) – _description_. Defaults to β€œignore”.

input_file (Path): File path to read file_metadata ([Callable[str, Dict]]): A function that takes

in a filename and returns a Dict of metadata for the Document.

file_extractor (Dict[str, BaseReader]): A mapping of file

extension to a BaseReader class that specifies how to convert that file to text.

filename_as_id (bool): Whether to use the filename as the document id. encoding (str): Encoding of the files.

Default is utf-8.

errors (str): how encoding and decoding errors are to be handled,

see https://docs.python.org/3/library/functions.html#open

Returns

loaded documents

Return type

List[Document]