What it means to serve an LLM and which serving technology to choose from

The rapidly growing interest in leveraging large language models (LLMs) for a wide range of applications has led to extensive exploration within both the industry and the open-source community. As the demand for these models in production environments continues to rise, the need to understand the available tools and frameworks alongside with their offerings and differences, becomes increasingly significant.

This blogpost takes a deep dive into what model serving is, the relevant components, parameters and evaluation metrics to pay attention to and model-serving frameworks, with a specific focus on their offerings and differences.

Together with this blogpost, we also release a whitepaper where we share the performance analysis for each tool, measuring throughput for various input/output length, batch sizes and request rate per second. By evaluating inference engines such as TensorRT-LLM, vLLM and inference servers such as RayLLM with RayServe, TGI and TensorRT-LLM + Triton, our goal is to provide ML practitioners with some insights. We share our main learnings at the end of this blogpost as well. Please refer to the whitepaper for more information about the experimental setting and findings.

Text Generation Process with LLMs

Before jumping into serving LLMs, let’s refresh our memory about how they actually work. So, the generation process of LLMs has two main steps; tokenization and decoding. Before jumping into the explanation of both steps, here are some terms to clarify:

Input sequence: The text that the user sends to the language model to be processed.

Output sequence: The text that the language model generates after receiving the input sequence from the users.

Input length: The number of tokens that input sequence contains

Output length: The number of tokens that output sequence contains

Token: The unit of text that model reads to generate the next token. This can be a whole word or a part of the word depending on how the tokenizer tokenizes the text (see Figure 1).

*Figure 1: The tokenization of GPT-3. Credit: Russ Kohn & OpenAI.*

Now we went through all needed terms, let’s start with an example: We have a chatbot that is powered by an LLM and we received an input sequence from one of the users. Can LLMs or any other machine learning model work with words for computing the weights and doing the matrix multiplication? Of course not. Since the machines, thus language models, can’t work with string, the input sequence is turned into tokens and then each of those tokens are converted into embeddings (see Figure 2), which can be fed to LLM. Each token is represented by an embedding vector.

*Figure 2: Turning tokens into embeddings, inspired by* *[1]*

In the decoding part, LLM will generate the next token in an autoregressive manner. In the next iteration, the newly generated token will be appended to the input sequence and the generation process will keep going until LLM hits the stopping criteria (e.g. maximum number of tokens, generation of a special <end> token). For more information about the generation process, refer to this blog.

Metrics for LLM Serving

When it comes to serving these models, there are two important metrics that needs to be underlined: throughput and latency.

Throughput

Throughput tells us how many users our system can handle effectively. Throughput stands for the number of generated tokens per second by the inference server throughout the multiple requests by the users. The higher the throughput, the better our system can accommodate and respond to user requests.

Latency

Latency, on the other hand, reflects the time it takes for the server and model to generate the complete output in the output sequence. If we're streaming the generated output to the end user, latency specifically refers to the time taken by the inference server to generate the very first token. This initial token generation time is also known as the "time to first token" (TTFT). Essentially, latency gives us insights into how responsive and swift our system is in delivering results to the end user

In a nutshell, let's simplify these concepts: latency is what users feel—the time it takes to receive an answer from the chatbot. Meanwhile, throughput is not only about how many users our system can effectively handle simultaneously but also impacts the user experience in stream mode—users feel the speed at which new words are generated. These metrics collectively shape our understanding of the system's performance and user experience.

What you need to serve an LLM

‍

When it comes to serving LLM based applications, there are 2 main components: engine and server. The engine handles everything about the models and batching the requests, while the server handles forwarding the user requests.

Engines

Engines are what run the models and everything that we covered so far about the generation process with different types of optimization techniques. In their core, these are Python libraries. They handle batching of the requests that are coming from users to our chatbot and generating the response for those requests.

Servers

Servers are responsible for orchestrating the HTTP/gRPC requests coming in from the users. In real world applications, we will have many users asking questions to our chatbot at different times of the day. Servers queue these requests and forward them to the engine for the generation of the response. Servers also bring the metrics such as throughput and latency, which are important to track for model serving.

Capabilities

Engines

Memory optimization
Model specific optimization
Batching support

Servers

HTTP/gRPC APIs
Request queuing
Multi-model serving
Multi-engine support

‍

*Table 1: A comparison of engine and server capabilities*

So far, we discussed a straightforward scenario where the model handles a single request. However, real-life applications demand the ability to serve hundreds, even thousands of users concurrently. Now, our focus shifts to optimizing costs and throughput, leading us to the next critical considerations: request batching and memory optimization with PagedAttention. These optimizations are pivotal for hosting the model efficiently, ensuring both cost-effectiveness and high throughput in the case of substantial user demand.

Request batching

One important aspect of LLM serving is batching the user requests. Rather than reloading parameters for each new request, an efficient approach involves loading parameters onto the GPU once and utilizing them to process as many input sequences as possible in one go. This method not only boosts server throughput and optimizes compute utilization but also significantly contributes to cost-effectiveness. However, adopting a naive approach, like waiting for a fixed number of user requests to accumulate before processing the batch, presents challenges. This means that each request generates the end of sequence token at different times within a batch. Consequently, your batch computation speed is limited by the longest generation time, resulting in undesirable waiting times (latency) for users. The variations in completion times among sequences lead to GPU underutilization, diminishing the efficiency gains expected from batching.

*Figure 4a: Static batching overview* *[2]*

Because of all the challenges that we talked about, continuous batching has been presented to solve these problems.

Continuous batching

Continuous batching is a type of batch scheduling that is specifically designed for LLMs. In comparison to dynamic batching, where batch size is determined dynamically according to configured time threshold and maximum batch size, continuous batching lets new requests join to the current batch in the next decoder cycle, instead of waiting for the current batch to end. Due to the autoregressive generation process of LLMs, this method works easily for LLMs and highly increases the throughput of the model.

Continuous batching is great for batching the requests dynamically. However, we also face another problem: memory boundedness. Consider our chatbot scenario—one user might ask a question with a single sentence while another user sends a paragraph to our application - it is impossible to assume the length of input (and output) sequences. This uncertainty brings us to the critical problem of memory consumption. Without knowing the exact memory requirements of a sequence, one is compelled to adopt the worst-case scenario, reserving the highest possible memory for the entire batch. Here's the issue: GPUs have finite memory, needing space for both 1) model parameters and 2) user request computations (KV cache) 3) for whole batch computations. Without optimization, these take up a lot of room, forcing us to shrink the batch size and, unfortunately, decrease throughput. But we want high throughput. How do we optimize this? Memory is the key.

Let’s have a deeper look into what is happening in the decoding process from a memory point of view. Generation process of the LLMs starts with processing the input sequence and generating the next token one by one in an autoregressive manner (see Figure 5). This generation process includes self-attention calculation which needs all key-value (KV) score calculations of each token that has been processed so far. To illustrate, for generation of the token t, we need the calculated key and values from the tokens t-1, t-2,....1.

‍

*Figure 5: Autoregressive generation process of LLMs* *[3]*

‍

To optimize the recurrent calculation, the concept of KV caching is introduced. This method aims to store previously computed K and V tensors of tokens in the decoder, subsequently reusing them in next iterations. However, this optimization strategy comes at the expense of increased memory space, which is very critical when the batch size is also high to increase the throughput. The challenge escalates due to the unpredictable sequence length, leading traditional attention mechanisms to result in a significant waste of memory—ranging from 60% to 80%—due to fragmentation and over-allocation.

PagedAttention: A Memory-Centric Solution

To overcome this challenge, PagedAttention is proposed. Drawing inspiration from the traditional operating system’s (OS) strategies for managing memory fragmentation and sharing, PagedAttention uses a virtual memory approach with paging. It allows the key and value vectors to be stored in non-contiguous memory space. This allows key and value vectors to reside in non-contiguous memory spaces, organized into blocks. Each block accommodates attention keys and values for a fixed number of tokens. While performing the computation, the PagedAttention kernel identifies and fetches the blocks efficiently. For a deeper dive into KV caching and PagedAttention, please refer to this paper and this blog.

Putting Knowledge into Action: Selecting Frameworks for LLM Serving

Now that we've covered important metrics, trade-offs, and techniques to handle critical challenges in LLM serving, the big question is: How do we put these techniques into action? Which tools are the best fit for our needs, and what should we know about the frameworks before diving in?

In this section, we delve into all these details of the key frameworks, sharing the primary findings derived from our benchmarking experiments. We picked popular and widely-used frameworks in the industry. Each framework has a unique value in optimizing and enhancing the performance of Large Language Models (LLMs) during inference. We categorize these frameworks into two groups: servers and engines. By the end, you will have a clear picture of the available tools and their potential fit for your specific LLM serving requirements.

Engines

TensorRT-LLM

An open-source library designed to accelerate and optimize inference performance on the latest LLMs using NVIDIA Tensor Core GPUs ^[4].
Wraps TensorRT’s Deep Learning Compiler, optimized kernels from FasterTransformer, pre- and post-processing, and multi-GPU/multi-node communication in a simple, open-source Python API for defining, optimizing and executing LLMs in production
Utilizing tensor parallelism, TensorRT-LLM allows for efficient inference at scale across multiple GPUs and servers without the need for extensive developer intervention.
Includes highly optimized, ready-to-run versions of popular LLMs, such as Meta Llama 2, OpenAI GPT-2 and GPT-3, Falcon, Mosaic MPT, and more.
Provides a C++ runtime for executing LLM engines, offering features like token sampling and KV cache management, further enhancing the efficiency of inference.
Supports in-flight batching, also known as continuous batching or iteration-level batching. This is a technique that aims at reducing wait times in queues, eliminating the need for padding requests, and making higher GPU utilization possible.
Aims to simplify the process of building and experimenting with new LLMs, providing peak performance and customization without requiring deep knowledge of C++ or NVIDIA CUDA.

Important notes:

TensorRT-LLM is creating the engine specifically for the flags that are used when running the build.py function. Therefore, it is important to note that you should specify maximum input, output length and maximum batch size before building the engine. If you want to change any of the parameters, you will need to rebuild the engine. This will take a couple of minutes.
TensorRT-LLM doesn’t tokenize the input. The users will need to tokenize the input and send it to the TensorRT-LLM engine. The engine only accepts token IDs.
Managing the allocation of the KV cache is not open sourced. Therefore, it is not very clear how it is exactly managed. However, we observed that TensorRT-LLM calculates the required memory depending on the batch size and input-output length and pre-allocates the KV cache memory accordingly. This memory is managed during runtime.

vLLM

A high-performance library tailored for LLM inference and serving, emphasizing state-of-the-art serving throughput and efficient management of attention ^[6].
Memory efficiency and high throughput are at the core of vLLM, thanks to its innovative PagedAttention mechanism. This approach optimizes memory allocation and allows for non-contiguous KV cache, translating into higher batch sizes and cost-effective serving ^[2].
Includes support for continuous batching, GPU parallelism, streaming output, and OpenAI compatibility.
Provides a Python API for conducting offline batched inference on datasets, establishing API servers for LLMs, and launching OpenAI-compatible API servers.

Important notes:

When there is a high request rate with high batch sizes, the memory can create a bottleneck. In this case, vLLM starts to preempt already calculated paged attentions due to memory shortage. This leads to more compute need in the long term when there are upcoming sequences from the user for the same request.

Servers

RayLLM with RayServe:

Built on Ray Serve, RayLLM benefits from a distributed compute framework that provides specialized libraries for data streaming, training, fine-tuning, hyperparameter tuning, and serving, simplifying the development and deployment of large-scale AI models ^[8].
Supports deployment of multi model endpoint.
It provides server capabilities while engine capabilities are provided by integrations such as continuous batching, paged attention, and other optimization techniques through TGI and vLLM integration.

Triton with TensorRT-LLM (Triton backend for TensorRT-LLM)

An open-source inference serving software that provides the ability to deploy models at scale in production environments. It supports various machine learning frameworks and is designed for high throughput and low latency inference workloads.
The Triton backend for TensorRT-LLM provides a solution for optimizing, deploying, and running LLMs efficiently. This combination leverages techniques like in-flight batching and paged KV-caching to enhance performance, while leveraging the advantages of TensorRT-LLM for rapid inference execution ^[9].
It works as an ensemble of models - There are multiple pipelines. First one handles tokenizing, which is handled by the Tokenizers from HuggingFace library. The next one is TensorRT-LLM with engine capabilities.

Server and Engine

Text Generation Inference (TGI)

A Rust, Python, gRPC server, used at HuggingFace to power HuggingChat, the Inference API and Inference Endpoint.
Utilizes tensor parallelism (Accelerate) for faster inference on multiple GPUs
Supports continuous batching for increased throughput, quantization, Paged and FlashAttention, token streaming using Server-Sent Events (SSE) and many more
Logits warper (different parameters such as temperature,repetition penalty, top-k, top-n, etc.)
Supports optimized set of specific LLMs
The license for usage has been changed. It is not free of charge for commercial use.

Main Findings

We assess the performance of these frameworks and their offerings in our whitepaper using different setups. Each framework—be it engines like TensorRT-LLM and vLLM, or servers like RayLLM with RayServe, Triton with TensorRT-LLM, and Text Generation Inference (TGI)—brings unique capabilities to the table, which are valuable for different use cases. Our benchmarking study uncovered nuanced findings, from memory allocation challenges to the strategic trade-offs of preemptions and the influence of sequence length on throughput. Here is a short overview what we have learned from the experiments:

Memory is the key. Management of memory allocation is critical for optimizing LLM performance.
Preemptions are a strategic trade-off for engines like vLLM since the generation operation is memory-bound while GPU is being underutilized.
Sequence length insight reveals vLLM's efficiency in handling concurrent requests, particularly with shorter outputs.
Model size significantly affects throughput. However, beyond a certain point, additional GPU memory no longer contributes to higher throughput.
Server selection plays a vital role, as demonstrated by TensorRT-LLM with Triton outperforming standalone TensorRT-LLM in the whitepaper.

For a more detailed overview of these findings, refer to our comprehensive benchmarking study white paper.

Final words

Understanding the text generation with LLMs, we have walked through tokenization, decoding, the challenges in serving effectively, the current state-of-the-art techniques and tools that try to solve these challenges. Metrics like throughput and latency emerged as critical benchmarks, influencing both user handling and streaming experiences that the teams need to pay attention to.

The landscape of LLMs and serving LLMs is growing every day. This blogpost and our white paper reflects the current state at time, but the field is constantly evolving, introducing new tools and techniques. Stay informed, embrace change, and explore new technologies to stay at the forefront of this transformative domain.

References

[1]: MunichNLP x TUM.AI LLM workshop

[2]: https://www.anyscale.com/blog/continuous-batching-llm-inference

[3]: https://huggingface.co/docs/transformers/llm_tutorial

[4]: https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/

[5]: https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/

[6]: https://github.com/vllm-project/vllm

[7]: https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html

[8]: https://docs.ray.io/en/latest/ray-overview/use-cases.html

[9]: https://github.com/triton-inference-server/tensorrtllm_backend

[10]: https://nvidia.github.io/TensorRT-LLM/architecture.html