Contributing a LlamaDataset To LlamaHub¶
LlamaDataset
's storage is managed through a git repository. To contribute a dataset requires making a pull request to llama_index/llama_datasets
Github (LFS) repository.
To contribute a LabelledRagDataset
(a subclass of BaseLlamaDataset
), two sets of files are required:
- The
LabelledRagDataset
saved as json namedrag_dataset.json
- Source document files used to create the
LabelledRagDataset
This brief notebook provides a quick example using the Paul Graham Essay text file.
%pip install llama-index-llms-openai
import nest_asyncio
nest_asyncio.apply()
Load Data¶
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
from llama_index.core import SimpleDirectoryReader
# Load documents and build index
documents = SimpleDirectoryReader(
input_files=["data/paul_graham/paul_graham_essay.txt"]
).load_data()
# generate questions against chunks
from llama_index.core.llama_dataset.generator import RagDatasetGenerator
from llama_index.llms.openai import OpenAI
# set context for llm provider
llm_gpt35 = OpenAI(model="gpt-4", temperature=0.3)
# instantiate a DatasetGenerator
dataset_generator = RagDatasetGenerator.from_documents(
documents,
llm=llm_gpt35,
num_questions_per_chunk=2, # set the number of questions per nodes
show_progress=True,
)
rag_dataset = dataset_generator.generate_dataset_from_nodes()
Now that we have our LabelledRagDataset
generated (btw, it's totally fine to manually create one with human generated queries and reference answers!), we store this into the necessary json file.
rag_dataset.save_json("rag_dataset.json")
Generating Baseline Results¶
In addition to adding just a LlamaDataset
, we also encourage adding baseline benchmarks for others to use as sort of measuring stick against their own RAG pipelines.
from llama_index.core import VectorStoreIndex
# a basic RAG pipeline, uses defaults
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = index.as_query_engine()
# manually
prediction_dataset = await rag_dataset.amake_predictions_with(
query_engine=query_engine, show_progress=True
)
Submitting The Pull-Requests¶
With the rag_dataset.json
and source file paul_graham_essay.txt
(note in this case, there is only one source document, but there can be several), we can perform the two steps for contributing a LlamaDataset
into LlamaHub
:
Similar, to how contributions are made for
loader
's,agent
's andpack
's, create a pull-request forllama_hub
repository that adds a new folder for newLlamaDataset
. This step uploads the information about the newLlamaDataset
so that it can be presented in theLlamaHub
UI.Create a pull request into
llama_datasets
repository to actually upload the data files.
Step 0 (Pre-requisites)¶
Fork and then clone (onto your local machine) both, the llama_hub
Github repository as well as the llama_datasets
one. You'll be submitting a pull requests into both of these repos from a new branch off of your forked versions.
Step 1¶
Create a new folder in llama_datasets/
of the llama_hub
Github repository. For example, in this case we would create a new folder llama_datasets/paul_graham_essay
.
In that folder, two files are required:
card.json
README.md
In particular, on your local machine:
cd llama_datasets/
mkdir paul_graham_essay
touch card.json
touch README.md
The suggestion here is to look at previously submitted LlamaDataset
's and modify their respective files as needed for your new dataset.
In our current example, we need the card.json
to look as follows
{
"name": "Paul Graham Essay",
"description": "A labelled RAG dataset based off an essay by Paul Graham, consisting of queries, reference answers, and reference contexts.",
"numberObservations": 44,
"containsExamplesByHumans": false,
"containsExamplesByAI": true,
"sourceUrls": [
"http://www.paulgraham.com/articles.html"
],
"baselines": [
{
"name": "llamaindex",
"config": {
"chunkSize": 1024,
"llm": "gpt-3.5-turbo",
"similarityTopK": 2,
"embedModel": "text-embedding-ada-002"
},
"metrics": {
"contextSimilarity": 0.934,
"correctness": 4.239,
"faithfulness": 0.977,
"relevancy": 0.977
},
"codeUrl": "https://github.com/run-llama/llama_datasets/blob/main/baselines/paul_graham_essay/llamaindex_baseline.py"
}
]
}
And for the README.md
, these are pretty standard, requiring you to change the name of the dataset argument in the download_llama_dataset()
function call.
from llama_index.llama_datasets import download_llama_datasets
from llama_index.llama_pack import download_llama_pack
from llama_index import VectorStoreIndex
# download and install dependencies for rag evaluator pack
RagEvaluatorPack = download_llama_pack(
"RagEvaluatorPack", "./rag_evaluator_pack"
)
rag_evaluator_pack = RagEvaluatorPack()
# download and install dependencies for benchmark dataset
rag_dataset, documents = download_llama_datasets(
"PaulGrahamEssayTruncatedDataset", "./data"
)
# evaluate
query_engine = VectorStoreIndex.as_query_engine() # previously defined, not shown here
rag_evaluate_pack.run(dataset=paul_graham_qa_data, query_engine=query_engine)
Finally, the last item for Step 1 is to create an entry to llama_datasets/library.json
file. In this case:
...,
"PaulGrahamEssayDataset": {
"id": "llama_datasets/paul_graham_essay",
"author": "andrei-fajardo",
"keywords": ["rag"],
"extra_files": ["paul_graham_essay.txt"]
}
Note: the extra_files
field is reserved for the source files.
Step 2 Uploading The Actual Datasets¶
In this step, since we use Github LFS on our llama_datasets
repo, making a contribution is exactly the same way you would make a contribution with any of our other open Github repos. That is, submit a pull request.
Make a fork of the llama_datasets
repo, and create a new folder in the llama_datasets/
directory that matches the id
field of the entry made in the library.json
file. So, for this example, we'll create a new folder llama_datasets/paul_graham_essay/
. It is here where we will add the documents and make the pull request.
To this folder, add rag_dataset.json
(it must be called this), as well as the rest of the source documents, which in our case is the paul_graham_essay.txt
file.
llama_datasets/paul_graham_essay/
├── paul_graham_essay.txt
└── rag_dataset.json
Now, simply git add
, git commit
and git push
your branch, and make your PR.