Parallel Processing SimpleDirectoryReader¶
In this notebook, we demonstrate how to use parallel processing when loading data with SimpleDirectoryReader
. Parallel processing can be useful with heavier workloads i.e., loading from a directory consisting of many files. (NOTE: if using Windows, you may see less gains when using parallel processing for loading data. This has to do with the differences between how multiprocess works in linux/mac and windows e.g., see here or here)
import cProfile, pstats
from pstats import SortKey
In this demo, we'll use the PatronusAIFinanceBenchDataset
llama-dataset from llamahub. This dataset is based off of a set of 32 PDF files which are included in the download from llamahub.
!llamaindex-cli download-llamadataset PatronusAIFinanceBenchDataset --download-dir ./data
from llama_index.core import SimpleDirectoryReader
# define our reader with the directory containing the 32 pdf files
reader = SimpleDirectoryReader(input_dir="./data/source_files")
Sequential Load¶
Sequential loading is the default behaviour and can be executed via the load_data()
method.
documents = reader.load_data()
len(documents)
4306
cProfile.run("reader.load_data()", "oldstats")
p = pstats.Stats("oldstats")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)
Wed Jan 10 12:40:50 2024 oldstats 1857432165 function calls (1853977584 primitive calls) in 391.159 seconds Ordered by: cumulative time List reduced from 292 to 15 due to restriction <15> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 391.159 391.159 {built-in method builtins.exec} 1 0.003 0.003 391.158 391.158 <string>:1(<module>) 1 0.000 0.000 391.156 391.156 base.py:367(load_data) 32 0.000 0.000 391.153 12.224 base.py:256(load_file) 32 0.127 0.004 391.149 12.223 docs_reader.py:24(load_data) 4306 1.285 0.000 387.685 0.090 _page.py:2195(extract_text) 4444/4306 5.984 0.001 386.399 0.090 _page.py:1861(_extract_text) 4444 0.006 0.000 270.543 0.061 _data_structures.py:1220(operations) 4444 43.270 0.010 270.536 0.061 _data_structures.py:1084(_parse_content_stream) 36489963/33454574 32.688 0.000 167.817 0.000 _data_structures.py:1248(read_object) 23470599 19.764 0.000 100.843 0.000 _page.py:1944(process_operation) 48258569 37.205 0.000 75.145 0.000 _utils.py:200(read_until_regex) 25208954 11.215 0.000 64.272 0.000 _base.py:481(read_from_stream) 18016574 23.488 0.000 49.305 0.000 __init__.py:88(crlf_space_check) 8642699 20.779 0.000 48.224 0.000 _utils.py:14(read_hex_string_from_stream)
<pstats.Stats at 0x16bb3d300>
Parallel Load¶
To load using parallel processes, we set num_workers
to a positive integer value.
documents = reader.load_data(num_workers=10)
len(documents)
4306
cProfile.run("reader.load_data(num_workers=10)", "newstats")
p = pstats.Stats("newstats")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)
Wed Jan 10 13:05:13 2024 newstats 12539 function calls in 31.319 seconds Ordered by: cumulative time List reduced from 212 to 15 due to restriction <15> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 31.319 31.319 {built-in method builtins.exec} 1 0.003 0.003 31.319 31.319 <string>:1(<module>) 1 0.000 0.000 31.316 31.316 base.py:367(load_data) 24 0.000 0.000 31.139 1.297 threading.py:589(wait) 23 0.000 0.000 31.139 1.354 threading.py:288(wait) 155 31.138 0.201 31.138 0.201 {method 'acquire' of '_thread.lock' objects} 1 0.000 0.000 31.133 31.133 pool.py:369(starmap) 1 0.000 0.000 31.133 31.133 pool.py:767(get) 1 0.000 0.000 31.133 31.133 pool.py:764(wait) 1 0.000 0.000 0.155 0.155 context.py:115(Pool) 1 0.000 0.000 0.155 0.155 pool.py:183(__init__) 1 0.000 0.000 0.153 0.153 pool.py:305(_repopulate_pool) 1 0.001 0.001 0.153 0.153 pool.py:314(_repopulate_pool_static) 10 0.001 0.000 0.152 0.015 process.py:110(start) 10 0.001 0.000 0.150 0.015 context.py:285(_Popen)
<pstats.Stats at 0x29408ab30>
In Conclusion¶
391 / 30
13.033333333333333
As one can observe from the results above, there is a ~13x speed up (or 1200% speed increase) when using parallel processing when loading from a directory with many files.