Github Issue AnalysisĀ¶
SetupĀ¶
InĀ [Ā ]:
Copied!
%pip install llama-index-readers-github
%pip install llama-index-llms-openai
%pip install llama-index-program-openai
%pip install llama-index-readers-github
%pip install llama-index-llms-openai
%pip install llama-index-program-openai
InĀ [Ā ]:
Copied!
import os
os.environ["GITHUB_TOKEN"] = "<your github token>"
import os
os.environ["GITHUB_TOKEN"] = ""
Load Github Issue ticketsĀ¶
InĀ [Ā ]:
Copied!
import os
from llama_index.readers.github import (
GitHubRepositoryIssuesReader,
GitHubIssuesClient,
)
github_client = GitHubIssuesClient()
loader = GitHubRepositoryIssuesReader(
github_client,
owner="jerryjliu",
repo="llama_index",
verbose=True,
)
docs = loader.load_data()
import os
from llama_index.readers.github import (
GitHubRepositoryIssuesReader,
GitHubIssuesClient,
)
github_client = GitHubIssuesClient()
loader = GitHubRepositoryIssuesReader(
github_client,
owner="jerryjliu",
repo="llama_index",
verbose=True,
)
docs = loader.load_data()
Found 100 issues in the repo page 1 Resulted in 100 documents Found 100 issues in the repo page 2 Resulted in 200 documents Found 100 issues in the repo page 3 Resulted in 300 documents Found 100 issues in the repo page 4 Resulted in 400 documents Found 4 issues in the repo page 5 Resulted in 404 documents No more issues found, stopping
Quick inspection
InĀ [Ā ]:
Copied!
docs[10].text
docs[10].text
Out[Ā ]:
"feat(context length): QnA Summarization as a relevant information extractor\n### Feature Description\r\n\r\nSummarizer can help in cases where the information is evenly distributed in the document i.e. a large amount of context is required but the language is verbose or there are many irrelevant details. Summarization specific to the query can help.\r\n\r\nEither cheap local model or even LLM are options; the latter for reducing latency due to large context window in RAG. \r\n\r\nAnother place where it helps is that percentile and top_k don't account for variable information density. (However, this may be solved with inter-node sub-node reranking). \r\n"
InĀ [Ā ]:
Copied!
docs[10].metadata
docs[10].metadata
Out[Ā ]:
{'state': 'open', 'created_at': '2023-07-13T11:16:30Z', 'url': 'https://api.github.com/repos/jerryjliu/llama_index/issues/6889', 'source': 'https://github.com/jerryjliu/llama_index/issues/6889'}
Extract themesĀ¶
InĀ [Ā ]:
Copied!
%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
InĀ [Ā ]:
Copied!
from pydantic import BaseModel
from typing import List
from tqdm.asyncio import asyncio
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.llms.openai import OpenAI
from llama_index.core.async_utils import batch_gather
from pydantic import BaseModel
from typing import List
from tqdm.asyncio import asyncio
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.llms.openai import OpenAI
from llama_index.core.async_utils import batch_gather
InĀ [Ā ]:
Copied!
prompt_template_str = """\
Here is a Github Issue ticket.
{ticket}
Please extract central themes and output a list of tags.\
"""
prompt_template_str = """\
Here is a Github Issue ticket.
{ticket}
Please extract central themes and output a list of tags.\
"""
InĀ [Ā ]:
Copied!
class TagList(BaseModel):
"""A list of tags corresponding to central themes of an issue."""
tags: List[str]
class TagList(BaseModel):
"""A list of tags corresponding to central themes of an issue."""
tags: List[str]
InĀ [Ā ]:
Copied!
program = OpenAIPydanticProgram.from_defaults(
prompt_template_str=prompt_template_str,
output_cls=TagList,
)
program = OpenAIPydanticProgram.from_defaults(
prompt_template_str=prompt_template_str,
output_cls=TagList,
)
InĀ [Ā ]:
Copied!
tasks = [program.acall(ticket=doc) for doc in docs]
tasks = [program.acall(ticket=doc) for doc in docs]
InĀ [Ā ]:
Copied!
output = await batch_gather(tasks, batch_size=10, verbose=True)
output = await batch_gather(tasks, batch_size=10, verbose=True)
[Optional] Save/Load Extracted ThemesĀ¶
InĀ [Ā ]:
Copied!
import pickle
import pickle
InĀ [Ā ]:
Copied!
with open("github_issue_analysis_data.pkl", "wb") as f:
pickle.dump(tag_lists, f)
with open("github_issue_analysis_data.pkl", "wb") as f:
pickle.dump(tag_lists, f)
InĀ [Ā ]:
Copied!
with open("github_issue_analysis_data.pkl", "rb") as f:
tag_lists = pickle.load(f)
print(f"Loaded tag lists for {len(tag_lists)} tickets")
with open("github_issue_analysis_data.pkl", "rb") as f:
tag_lists = pickle.load(f)
print(f"Loaded tag lists for {len(tag_lists)} tickets")
Summarize ThemesĀ¶
Build prompt
InĀ [Ā ]:
Copied!
prompt = """
Here is a list of central themes (in the form of tags) extracted from a list of Github Issue tickets.
Tags for each ticket is separated by 2 newlines.
{tag_lists_str}
Please summarize the key takeaways and what we should prioritize to fix.
"""
tag_lists_str = "\n\n".join([str(tag_list) for tag_list in tag_lists])
prompt = prompt.format(tag_lists_str=tag_lists_str)
prompt = """
Here is a list of central themes (in the form of tags) extracted from a list of Github Issue tickets.
Tags for each ticket is separated by 2 newlines.
{tag_lists_str}
Please summarize the key takeaways and what we should prioritize to fix.
"""
tag_lists_str = "\n\n".join([str(tag_list) for tag_list in tag_lists])
prompt = prompt.format(tag_lists_str=tag_lists_str)
Summarize with GPT-4
InĀ [Ā ]:
Copied!
from llama_index.llms.openai import OpenAI
response = OpenAI(model="gpt-4").stream_complete(prompt)
from llama_index.llms.openai import OpenAI
response = OpenAI(model="gpt-4").stream_complete(prompt)
InĀ [Ā ]:
Copied!
for r in response:
print(r.delta, end="")
for r in response:
print(r.delta, end="")
1. Bug Fixes: There are numerous bugs reported across different components such as 'Updating/Refreshing documents', 'Supabase Vector Store', 'Parsing', 'Qdrant', 'LLM event', 'Service context', 'Chroma db', 'Markdown Reader', 'Search_params', 'Index_params', 'MilvusVectorStore', 'SentenceSplitter', 'Embedding timeouts', 'PGVectorStore', 'NotionPageReader', 'VectorIndexRetriever', 'Knowledge Graph', 'LLM content', and 'Query engine'. These issues need to be prioritized and resolved to ensure smooth functioning of the system. 2. Feature Requests: There are several feature requests like 'QnA Summarization', 'BEIR evaluation', 'Cross-Node Ranking', 'Node content', 'PruningMode', 'RelevanceMode', 'Local-model defaults', 'Dynamically selecting from multiple prompts', 'Human-In-The-Loop Multistep Query', 'Explore Tree-of-Thought', 'Postprocessing', 'Relevant Section Extraction', 'Original Source Reconstruction', 'Varied Latency in Retrieval', and 'MLFlow'. These features can enhance the capabilities of the system and should be considered for future development. 3. Code Refactoring and Testing: There are mentions of code refactoring, testing, and code review. This indicates a need for improving code quality and ensuring robustness through comprehensive testing. 4. Documentation: There are several mentions of documentation updates, indicating a need for better documentation to help users understand and use the system effectively. 5. Integration: There are mentions of integration with other systems like 'BEIR', 'Langflow', 'Hugging Face', 'OpenAI', 'DynamoDB', and 'CometML'. This suggests a need for better interoperability with other systems. 6. Performance and Efficiency: There are mentions of 'Parallelize sync APIs', 'Average query time', 'Efficiency', 'Upgrade', and 'Execution Plan'. This indicates a need for improving the performance and efficiency of the system. 7. User Experience (UX): There are mentions of 'UX', 'Varied Latency in Retrieval', and 'Human-In-The-Loop Multistep Query'. This suggests a need for improving the user experience. 8. Error Handling: There are several mentions of error handling, indicating a need for better error handling mechanisms to ensure system robustness. 9. Authentication: There are mentions of 'authentication' and 'API key', indicating a need for secure access mechanisms. 10. Multilingual Support: There is a mention of 'LLMäøęåŗēØäŗ¤ęµå¾®äæ”ē¾¤', indicating a need for multilingual support.