Lagecy office reader
Lagacy Office Reader
The LegacyOfficeReader
is the reader for Word-97(.doc) files. Under the hood, it uses Apache Tika to parse the file.
Get Started¶
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙 and the legacy office reader.
Note: Apache Tika is a dependency of the legacy office reader and it requires Java to be installed and call-able via
java --version
.For instance, on colab, you can install it with
!apt-get install default-jdk
. or on macOS, you can install it withbrew install openjdk
.
%pip install llama-index-readers-legacy-office
Prepare Data
So we need to prepare a .doc file for testing. Supposedly it's in test_dir/harry_potter_lagacy.doc
from llama_index.readers.legacy_office import LegacyOfficeReader
Option 1: Load the file with LegacyOfficeReader
file_path = "./test_dir/harry_potter_lagacy.doc"
reader = LegacyOfficeReader(
excluded_embed_metadata_keys=["file_path", "file_name"],
excluded_llm_metadata_keys=["file_type"],
)
docs = reader.load_data(file=file_path)
print(f"Loaded {len(docs)} docs")
Loaded 1 docs
Option 2: Load the file with SimpleDirectoryReader
This is the path where we have .doc
files together with other files in the same directory.
from llama_index.core import SimpleDirectoryReader
reader = SimpleDirectoryReader(
input_dir="./test_dir/",
file_extractor={
".doc": LegacyOfficeReader(),
}
)