Open In Colab

Nvidia TensorRT-LLM#

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.

TensorRT-LLM Github

TensorRT-LLM Environment Setup#

Since TensorRT-LLM is a SDK for interacting with local models in process there are a few environment steps that must be followed to ensure that the TensorRT-LLM setup can be used.

  1. Nvidia Cuda 12.2 or higher is currently required to run TensorRT-LLM

  2. Install tensorrt_llm via pip with pip3 install tensorrt_llm -U --extra-index-url

  3. For this example we will use Llama2. The Llama2 model files need to be created via scripts following the instructions here

    • The following files will be created from following the stop above

    • Llama_float16_tp1_rank0.engine: The main output of the build script, containing the executable graph of operations with the model weights embedded.

    • config.json: Includes detailed information about the model, like its general structure and precision, as well as information about which plug-ins were incorporated into the engine.

    • model.cache: Caches some of the timing and optimization information from model compilation, making successive builds quicker.

  4. mkdir model

  5. Move all of the files mentioned above to the model directory.

%pip install llama-index-llms-nvidia-tensorrt
!pip install tensorrt_llm==0.7.0 --extra-index-url --extra-index-url

Basic Usage#

Call complete with a prompt#

from llama_index.llms.nvidia_tensorrt import LocalTensorRTLLM

def completion_to_prompt(completion: str) -> str:
    Given a completion, return the prompt using llama2 format.
    return f"<s> [INST] {completion} [/INST] "

llm = LocalTensorRTLLM(
resp = llm.complete("Who is Paul Graham?")