Installation¶
First, let's install LlamaIndex 🦙 and the Unify integration.
%pip install llama-index-llms-unify llama-index
Environment Setup¶
Make sure to set the UNIFY_API_KEY
environment variable. You can get a key from the Unify Console.
import os
os.environ["UNIFY_API_KEY"] = "<YOUR API KEY>"
Using LlamaIndex with Unify¶
Basic Usage¶
Below we initialize and query a chat model using the llama-3-70b-chat
endpoint from together-ai
.
from llama_index.llms.unify import Unify
llm = Unify(model="llama-3-70b-chat@together-ai")
llm.complete("How are you today, llama?")
CompletionResponse(text="I'm not actually a llama, but I'm doing great, thanks for asking! I'm a large language model, so I don't have feelings like humans do, but I'm always happy to chat with you and help with any questions or topics you'd like to discuss. How about you? How's your day going?", additional_kwargs={}, raw={'id': '88b5fcf02e259527-LHR', 'choices': [Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="I'm not actually a llama, but I'm doing great, thanks for asking! I'm a large language model, so I don't have feelings like humans do, but I'm always happy to chat with you and help with any questions or topics you'd like to discuss. How about you? How's your day going?", role='assistant', function_call=None, tool_calls=None))], 'created': 1716980504, 'model': 'llama-3-70b-chat@together-ai', 'object': 'chat.completion', 'system_fingerprint': None, 'usage': CompletionUsage(completion_tokens=67, prompt_tokens=17, total_tokens=84, cost=7.56e-05)}, logprobs=None, delta=None)
Single Sign-On¶
You can use Unify's SSO to query endpoints in different providers without making accounts with all of them. For example, all of these are valid endpoints:
llm = Unify(model="llama-2-70b-chat@together-ai")
llm = Unify(model="gpt-3.5-turbo@openai")
llm = Unify(model="mixtral-8x7b-instruct-v0.1@mistral-ai")
This allows you to quickly switch and test different models and providers. You can look at all the available models/providers here!
Runtime Dynamic Routing¶
As evidenced by our benchmarks, the optimal provider for each model varies by geographic location and time of day due to fluctuating API performances. To cirumvent this, we automatically direct your requests to the "top performing provider" at runtime. To enable this feature, simply replace your query's provider with one of the available routing modes. Let's look at some examples:
llm = Unify(
model="llama-2-70b-chat@input-cost"
) # route to lowest input cost provider
llm = Unify(
model="gpt-3.5-turbo@itl"
) # route to provider with lowest inter token latency
llm = Unify(
model="mixtral-8x7b-instruct-v0.1@ttft"
) # route to provider with lowest time to first token.
Quality Routing¶
Unify routes your queries to the best LLM on every prompt to consistently achieve better quality outputs than using a single, all-purpose, powerful model, at a fraction of the cost. This is achieved by using smaller models for simpler tasks, only using largers ones to handle complex queries.
The router is benchmarked on various different data-sets such as Open Hermes
, GSM8K
, HellaSwag
, MMLU
and MT-Bench
revealing that it can peform better than indivudal endpoints on average as explained here. One can choose various different configurations of the router for a particular data-set from the chat-interface as shown below:
llm = Unify(model="router_2.58e-01_9.51e-04_3.91e-03@unify")
llm = Unify(model="router_2.12e-01_5.00e-04_2.78e-04@unify")
llm = Unify(model="router_2.12e-01_5.00e-04_2.78e-04@unify")
To learn more about quality routing, please refer to this video.
Streaming and optimizing for latency¶
If you are building an application where responsiveness is key, you most likely want to get a streaming response. On top of that, ideally you would use the provider with the lowest Time to First Token, to reduce the time your users are waiting for a response. Using Unify this would look something like:
llm = Unify(model="mixtral-8x7b-instruct-v0.1@ttft")
response = llm.stream_complete(
"Translate the following to German: "
"Hey, there's an emergency in translation street, "
"please send help asap!"
)
show_provider = True
for r in response:
if show_provider:
print(f"Model and provider are : {r.raw['model']}\n")
show_provider = False
print(r.delta, end="", flush=True)
Model and provider are : mixtral-8x7b-instruct-v0.1@mistral-ai Hallo, es gibt einen Notfall in der Übersetzungsstraße, bitte senden Sie Hilfe so schnell wie möglich! (Note: This is a loose translation and the phrase "Übersetzungsstraße" does not literally exist, but I tried to convey the same meaning as the original message.)
Async calls and Lowest Input Cost¶
Last but not the least, you can also run multiple requests asynchronously. For tasks such as document summarization, optimizing for input costs is crucial. We can use the input-cost
dynamic routing mode to route our queries to the cheapest provider.
llm = Unify(model="mixtral-8x7b-instruct-v0.1@input-cost")
response = await llm.acomplete(
"Summarize this in 10 words or less. OpenAI is a U.S. based artificial intelligence "
"(AI) research organization founded in December 2015, researching artificial intelligence "
"with the goal of developing 'safe and beneficial' artificial general intelligence, "
"which it defines as 'highly autonomous systems that outperform humans at most economically "
"valuable work'. As one of the leading organizations of the AI spring, it has developed "
"several large language models, advanced image generation models, and previously, released "
"open-source models. Its release of ChatGPT has been credited with starting the AI spring"
)
print(f"Model and provider are : {response.raw['model']}\n")
print(response)
Model and provider are : mixtral-8x7b-instruct-v0.1@deepinfra OpenAI: Pioneering 'safe' artificial general intelligence.