Contact information

PromptCloud Inc, 16192 Coastal Highway, Lewes De 19958, Delaware USA 19958

We are available 24/ 7. Call Now. marketing@promptcloud.com
llm
Avatar

Introduction to Inference Engines

There are many optimization techniques developed to mitigate the inefficiencies that occur in the different stages of the inference process. It is difficult to scale the inference at scale with vanilla transformer/ techniques. Inference engines wrap up the optimizations into one package and eases us in the inference process.

For a very small set of adhoc testing, or quick reference we can use the vanilla transformer code to do the inference.

The landscape of inference engines is quickly evolving, as we have multiple choices, it is important to test and short list the best of best for specific use cases. Below, are some inference engines experiments which we made and the reasons we found out why it worked for our case.

For our fine tuned Vicuna-7B model, we have tried

We went through the github page and its quick start guide to setup these engines, PowerInfer, LlaamaCPP, Ctranslate2 are not very flexible and do not support many optimization techniques like continuous batching, paged attention and held sub-par performance when compared to other mentioned engines.

To obtain higher throughput the inference engine/server should maximize the memory and compute capabilities and both client and server must work in a parallel/ asynchronous way of serving requests to keep the server always in work. As mentioned earlier, without help of optimization techniques like PagedAttention, Flash Attention, Continuous batching it will always lead to suboptimal performance.

TGI, vLLM and Aphrodite are more suitable candidates in this regard and by doing multiple experiments stated below, we found the optimal configuration to squeeze the maximum performance out of the inference. Techniques like Continuous batching and paged attention are enabled by default, speculative decoding needs to be enabled manually in the inference engine for the below tests.

Comparative Analysis of Inference Engines

TGI

To use TGI, we can go through the ‘Get Started’ section of the github page, here docker is the simplest way to configure and use the TGI engine. 

Text-generation-launcher arguments -> this list down different settings we can use on the server side. Few important ones, 

  • –max-input-length:  determines the maximum length of input to the model, this requires changes in most cases, as default is 1024.
  • –max-total-tokens: max total tokens i.e input + output token length.
  • –speculate, –quantiz, –max-concurrent-requests -> default is 128 only which is obviously less.

To start a local fine tuned model, 

docker run –gpus device=1 –shm-size 1g -p 9091:80 -v /path/to/fine_tuned_v1:/model ghcr.io/huggingface/text-generation-inference:1.4.4 –model-id /model –dtype float16 –num-shard 1 –max-input-length 3600 –max-total-tokens 4000 –speculate 2

To start a model from hub,

model=”lmsys/vicuna-7b-v1.5″; volume=$PWD/data; token=”<hf_token>”; docker run –gpus all –shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 9091:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4.4 –model-id $model –dtype float16 –num-shard 1 –max-input-length 3600 –max-total-tokens 4000 –speculate 2

You can ask chatGPT to explain the above command for more detailed understanding. Here we are starting the inference server at 9091 port. And we can use a client of any language to post a request to the server. Text Generation Inference API -> mentions all the endpoints and payload parameters for requesting.

E.g. 

payload=”<prompt here>”

curl -XPOST “0.0.0.0:9091/generate” -H “Content-Type: application/json” -d “{“inputs”: $payload, “parameters”: {“max_new_tokens”: 400,”do_sample”:false,”best_of”: null,”repetition_penalty”: 1,”return_full_text”: false,”seed”: null,”stop_sequences”: null,”temperature”: 0.1,”top_k”: 100,”top_p”: 0.3,”truncate”: null,”typical_p”: null,”watermark”: false,”decoder_input_details”: false}}”

Few observations, 

  • Latency increases with max-token-tokens, which is obvious that, if we are processing long text, then overall time will increase.
  • Speculate helps but it depends on use-case and input-output distribution.
  • Eetq quantization helps the most in increasing the throughput. 
  • If you have a multi GPU, running 1 API on each GPU and having these multi GPU APIs behind a load-balancer results in higher throughput than sharding by TGI itself.

vLLM

To start a vLLM server, we can use an OpenAI compatible REST API server/ docker. It is very simple to start, follow Deploying with Docker — vLLM, if you are going to use a local model, then attach the volume and use the path as model name,

docker run –runtime nvidia –gpus device=1 –shm-size 1g -v /path/to/fine_tuned_v1:/model -v ~/.cache/ -p 8000:8000 –ipc=host vllm/vllm-openai:latest –model /model

Above will start a vLLM server on the mentioned 8000 port, as always you can play with arguments.

Make post request with,

“`shell

payload=”<prompt here>”

curl -XPOST -m 1200 “0.0.0.0:8000/v1/completions” -H “Content-Type: application/json” -d “{“prompt”: $payload,”model”:”/model” ,”max_tokens”: 400,”top_p”: 0.3, “top_k”: 100,   “temperature”: 0.1}”

“`

Aphrodite

“`shell

pip install aphrodite-engine

python -m aphrodite.endpoints.openai.api_server –model PygmalionAI/pygmalion-2-7b

“`

Or 

“`

docker run -v /path/to/fine_tuned_v1:/model -d -e MODEL_NAME=”/model” -p 2242:7860 –gpus device=1 –ipc host alpindale/aphrodite-engine

“`

Aphrodite provides both pip and docker installation as mentioned in the getting started section. Docker is generally relatively easier to spin up and test. Usage options, server options help us how to make requests.

  • Aphrodite and vLLM both uses, openAI server based payloads, so you can check its documentation.
  • We tried deepspeed-mii, since it is in transitional state(when we tried) from legacy to new codebase, it does not look reliable and easy to use.
  • Optimum-NVIDIA doesnt support major other optimizations and results in suboptimal performance, ref link.
  • Added a gist, the code we used to do the ad hoc parallel requests.

Metrics and Measurements

We want to try out and find:

  1. Optimal no. of threads for the client/ inference engine server.
  2. How throughput grows w.r.t increase in memory
  3. How throughput grows w.r.t tensor cores.
  4. Effect of threads vs parallel requesting by client.

Very basic way to observe the utilization is to watch it via linux utils nvidia-smi, nvtop, this will tell us the memory occupied, compute utilization, data transfer rate etc.

Another way is to profile the process using GPU with nsys. 

S.NoGPUvRAM MemoryInference engineThreadsTime (s)Speculate
1A600048 /48GBTGI24664
2A600048 /48GBTGI64561
3A600048 /48GBTGI128554
4A600048 /48GBTGI256568

Based on above experiments, 128/ 256 thread is better than lower thread number and beyond 256 overhead starts contributing towards reduced throughput. This is found to be dependent on CPU and GPU, and needs one’s own experiment.
5A600048 /48GBTGI1285962
6A600048 /48GBTGI1289458

Higher speculate value causing more rejections for our fine-tuned model and thus reducing throughput. 1 / 2 as speculate value is fine, this is subject to model and not guaranteed to work the same across use cases. But the conclusion is speculative decoding improves the throughput.
7309024/ 24GBTGI1287412
7409024/ 24GBTGI1284812

4090 has even though less vRAM compared to A6000, it outperforms due to higher tensor core count and memory bandwidth speed.
8A600024/ 48GBTGI1287072
9A60002 x 24/ 48GBTGI12812052

Setting Up and Configuring TGI for High Throughput

Set up asynchronous requesting in a scripting language of choice like python/ ruby and with using the same file for configuration we found:

  1. Time taken increases w.r.t maximum output length of sequence generating.
  2. 128/ 256 threads on client and server is better than 24, 64, 512. When using lower threads, the compute is being under-utilized and beyond a threshold like 128 the overhead becomes higher and thus throughput has reduced.
  3. There is a 6% improvement when jumping from asynchronous to parallel requests using ‘GNU parallel’ instead of threading in languages like Go, Python/ Ruby.
  4. 4090 has 12% higher throughput than A6000. 4090 has even though less vRAM compared to A6000, it outperforms due to higher tensor core count and memory bandwidth speed.
  5. Since A6000 has 48GB vRAM, to conclude whether the extra RAM helps in improving throughput or not, we tried using fractions of GPU memory in experiment 8 of the table, we see the extra RAM helps in improving but not linearly. Also when tried splitting i.e hosting 2 API on the same GPU with using half memory for each API, it behaves like 2 sequential API running, instead of accepting requests parallelly. 

Observations and Metrics

Below are graphs for some experiments and the time taken to complete a fixed input set, lower the time taken is better.

  • Mentioned is client side threads. Server side we need to mention while starting the inference engine.

Speculate testing:

Multiple Inference Engines testing:

Same kind of experiments done with other engines like vLLM and Aphrodite we observe similar kind of results, as of when writing this article vLLM and Aphrodite doesn’t support speculative decoding yet, that leaves us to pick TGI as it gives higher throughput than rest due to speculative decoding.

Additionally, you can configure GPU profilers to enhance observability, aiding in the identification of areas with excessive resource usage and optimizing performance. Further read: Nvidia Nsight Developer Tools — Max Katz 

Conclusion 

We see the landscape of inference generation is constantly evolving and improving throughput in LLM requires a good understanding of GPU, performance metrics, optimization techniques, and challenges associated with text generation tasks. This helps in choosing the right tools for the job. By comprehending GPU internals and how they correspond to LLM inference, such as leveraging tensor cores and maximizing memory bandwidth, developers can choose the cost-efficient GPU and optimize performance effectively.

Different GPU cards offer varying capabilities, and understanding the differences is crucial for selecting the most suitable hardware for specific tasks. Techniques like continuous batching, paged attention, kernel fusion, and flash attention offer promising solutions to overcome arising challenges and improve efficiency. TGI looks the best choice for our use case based on the experiments and results we obtain.

Read other articles related to large language model:

Understanding GPU Architecture for LLM Inference Optimization

Advanced Techniques for Enhancing LLM Throughput

Sharing is caring!

Are you looking for a custom data extraction service?

Contact Us