LLM Benchmark#
Benchmark suites for measuring LLM serving performance with vLLM, SGLang, and TensorRT-LLM. All use similar methodology — same test categories, workloads, and metrics — for easy comparison between the three inference engines.
vLLM:
vllm bench servevia bench.shSGLang:
python -m sglang.bench_servingvia bench.shTensorRT-LLM:
python -m tensorrt_llm.serve.scripts.benchmark_servingvia bench.sh
The scripts handle Docker image loading and container management automatically. If the
CLI is not available on the host, they load the Docker image and re-execute inside the
container. When running under a SLURM allocation, they use srun to dispatch to the
compute node.
Quick Start#
Launch a server in one terminal, then run benchmarks from another. The benchmark script auto-detects the model from the server.
vLLM:
# Terminal 1: start server
vllm serve Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 8000
# Terminal 2: run benchmarks
bash bench.sh -H localhost -i vllm-serve:latest
bash bench.sh -H localhost --type throughput,prefill
SGLang:
# Terminal 1: start server
python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000
# Terminal 2: run benchmarks
bash bench.sh -H localhost -i sglang-serve:latest
bash bench.sh -H localhost --type throughput,prefill
TensorRT-LLM:
# Terminal 1: start server
trtllm-serve /path/to/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 8000
# Terminal 2: run benchmarks (requires -m for tokenizer loading)
bash bench.sh -H localhost -m /path/to/Qwen2.5-7B-Instruct -i tensorrt-llm-serve:latest
bash bench.sh -H localhost -m /path/to/Qwen2.5-7B-Instruct --type throughput,prefill
Multi-Node with Slurm#
For benchmarking larger models (e.g., DeepSeek-V3, Llama-3.1-405B) that cannot fit on a
single node, refer to Distributed Serving on SLURM
for how to deploy multi-node serving with different parallelism strategies. Once the
server is running, benchmark using bench.sh as shown in the Quick Start above.
Throughput#
Measures peak output tokens/sec by saturating the server with requests. Uses
request-rate=inf to send all prompts immediately, forcing the scheduler to batch
aggressively. This reveals the server’s maximum throughput under full load.
512in/256out is a moderate workload that exercises both the prefill phase (processing
the input) and the decode phase (generating tokens).
# vLLM
vllm bench serve --dataset-name random --random-input-len 512 --random-output-len 256 \
--num-prompts 100 --request-rate inf
# SGLang
python -m sglang.bench_serving --dataset-name random --random-input 512 --random-output 256 \
--num-prompts 100 --request-rate inf
# TensorRT-LLM
python -m tensorrt_llm.serve.scripts.benchmark_serving \
--dataset-name random --random-input-len 512 --random-output-len 256 \
--num-prompts 100 --max-concurrency 100
Prefill (TTFT)#
Measures Time to First Token — how fast the server processes the input prompt before
generating the first output token. output-len=1 isolates prefill from decode since
nearly all compute goes to processing the input.
Sweeping input length (128→16K) reveals how TTFT scales with context size. Prefill
compute is O(n) per layer, so TTFT should grow roughly linearly. rate=4 keeps the
server lightly loaded so TTFT reflects compute time, not queueing delay.
# vLLM - sweeps input length: 128, 512, 2048, 4096, 16384
vllm bench serve --dataset-name random --random-input-len $LEN --random-output-len 1 \
--num-prompts 100 --request-rate 4
# SGLang
python -m sglang.bench_serving --dataset-name random --random-input $LEN --random-output 1 \
--num-prompts 100 --request-rate 4
# TensorRT-LLM
python -m tensorrt_llm.serve.scripts.benchmark_serving \
--dataset-name random --random-input-len $LEN --random-output-len 1 \
--num-prompts 100 --max-concurrency 4
Decode (ITL)#
Measures Inter-Token Latency — the time between consecutive output tokens during
autoregressive generation. input-len=128 keeps prefill minimal so the benchmark
focuses on the decode phase.
Sweeping output length (128→1024) reveals how ITL changes as the KV cache grows. Longer
sequences increase memory pressure and may trigger PagedAttention block allocation or
preemption. rate=4 avoids batching interference so ITL reflects single-request
decode speed.
# vLLM - sweeps output length: 128, 256, 512, 1024
vllm bench serve --dataset-name random --random-input-len 128 --random-output-len $LEN \
--num-prompts 100 --request-rate 4
# SGLang
python -m sglang.bench_serving --dataset-name random --random-input 128 --random-output $LEN \
--num-prompts 100 --request-rate 4
# TensorRT-LLM
python -m tensorrt_llm.serve.scripts.benchmark_serving \
--dataset-name random --random-input-len 128 --random-output-len $LEN \
--num-prompts 100 --max-concurrency 4
Latency (E2E)#
Measures end-to-end request latency under minimal load — the “single user” experience.
rate=1 ensures requests are mostly processed alone with no batching, giving the
baseline best-case latency (similar to ChatGPT-style usage where one user waits for a
complete response).
Three size classes (short/medium/long) show how total latency scales with request size. E2E latency = TTFT + (output_tokens × ITL).
# vLLM - tests short (128/128), medium (512/256), long (4096/512)
vllm bench serve --dataset-name random --random-input-len $IN --random-output-len $OUT \
--num-prompts 100 --request-rate 1
# SGLang
python -m sglang.bench_serving --dataset-name random --random-input $IN --random-output $OUT \
--num-prompts 100 --request-rate 1
# TensorRT-LLM
python -m tensorrt_llm.serve.scripts.benchmark_serving \
--dataset-name random --random-input-len $IN --random-output-len $OUT \
--num-prompts 100 --max-concurrency 1
Concurrency#
Finds the server’s saturation point by sweeping the number of concurrent requests.
request-rate=inf with max-concurrency=N caps how many requests run in parallel,
decoupling arrival rate from concurrency.
At low concurrency (1–4), latency is good but throughput is low (GPU underutilized). At high concurrency (64–256), throughput plateaus and latency degrades (queueing). The “knee” where throughput stops improving is the optimal operating point — it tells you how many concurrent users the server can handle before quality degrades.
# vLLM - sweeps concurrency: 1, 4, 16, 64, 256
vllm bench serve --dataset-name random --random-input-len 512 --random-output-len 256 \
--num-prompts 100 --request-rate inf --max-concurrency $C
# SGLang
python -m sglang.bench_serving --dataset-name random --random-input 512 --random-output 256 \
--num-prompts 100 --request-rate inf --max-concurrency $C
# TensorRT-LLM
python -m tensorrt_llm.serve.scripts.benchmark_serving \
--dataset-name random --random-input-len 512 --random-output-len 256 \
--num-prompts 100 --max-concurrency $C
Sonnet (Prefix Caching)#
The sonnet dataset uses Shakespeare’s sonnets with a shared prefix across all prompts. This tests prefix caching — if enabled, the shared prefix KV cache is computed once and reused across requests, reducing TTFT.
# Download sonnet dataset
wget -q https://raw.githubusercontent.com/vllm-project/vllm/main/benchmarks/sonnet.txt
# vLLM - prefill-heavy (short output isolates prefill)
vllm bench serve --dataset-name sonnet --dataset-path sonnet.txt \
--sonnet-input-len 550 --sonnet-output-len 150 --sonnet-prefix-len 200 \
--num-prompts 100 --request-rate inf
# vLLM - realistic load
vllm bench serve --dataset-name sonnet --dataset-path sonnet.txt \
--sonnet-input-len 550 --sonnet-output-len 150 --sonnet-prefix-len 200 \
--num-prompts 100 --request-rate 4
Both ShareGPT and sonnet are used by the vLLM team’s
v0.6.0 performance blog to
benchmark serving engines. To learn more about the methodology, see the
reproduction steps and SGLang’s
counter-benchmark,
which uses sglang.bench_serving to compare both engines:
# Launch servers
# vLLM with multi-step scheduling
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--disable-log-requests --num-scheduler-steps 10 --max_model_len 4096
# SGLang
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--enable-torch-compile --disable-radix-cache
# Online benchmark (realistic load)
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt \
--num-prompts 1200 --request-rate 4
python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt \
--num-prompts 1200 --request-rate 4
# Offline benchmark (max throughput)
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt \
--num-prompts 5000
python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt \
--num-prompts 5000
Key Metrics#
TTFT (Time to First Token): Time from request arrival to first generated token. Dominated by prefill compute. Lower is better for interactive use.
ITL (Inter-Token Latency): Time between consecutive tokens. Reflects decode speed and consistency.
TPOT (Time Per Output Token): Average time per generated token. Similar to ITL but averaged across all tokens.
E2E Latency: Total time from request to completion. E2E ≈ TTFT + (tokens × ITL).
Throughput: Output tokens/sec across all requests. Higher is better for batch workloads.
CLI Differences#
Parameter |
vLLM |
SGLang |
TensorRT-LLM |
|---|---|---|---|
Input length |
|
|
|
Output length |
|
|
|
Max rate |
|
|
|
Random dataset |
(works by default) |
(works by default) |
|
Model flag |
auto-detected |
auto-detected |
|
Results |
|
|
|
Profiling#
Benchmark runs can be combined with profiling to correlate performance metrics with GPU-level traces. Two profiling approaches are available for vLLM:
PyTorch profiler — vLLM’s built-in profiler triggered via REST endpoints. Start the
server with --profile (or --profiler-config), then pass --profile to the
benchmark client to call /start_profile and /stop_profile around the workload.
# Server — start with profiling enabled
bash run.sbatch --profile \
Qwen/Qwen3-30B-A3B-FP8 \
--tensor-parallel-size 8
# Client — benchmark with profiling
bash bench.sh -H <server-host> --type throughput --profile
View traces at https://ui.perfetto.dev/ (supports .gz files directly).
Nsight Systems — wraps vllm serve with nsys profile for CUDA kernel,
NVTX, and memory tracing. Combine with --profiler-config '{"profiler": "cuda"}'
to also capture vLLM’s internal CUDA profiler markers.
# Server — enable nsys + CUDA profiler (terminal 0)
bash run.sbatch --nsys \
Qwen/Qwen3-30B-A3B-FP8 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--profiler-config '{"profiler": "cuda"}'
# Client — benchmark with profiling (terminal 1)
bash bench.sh -H <server-host> --type throughput --profile
# Stop server with Ctrl+C (terminal 0)
# Nsys finalizes profiles (~30s)
# Profile files: nsys-vllm/profile-node*.nsys-rep
Open .nsys-rep files with Nsight Systems.
See the vLLM Serving Guide
for full run.sbatch flag reference and the
vLLM Profiling Guide
for more details.
Offline Benchmarking#
vLLM also supports offline benchmarking to measure raw inference performance without API server overhead. This is useful for:
Measuring peak throughput without network/serialization overhead
Multi-node distributed inference testing
Profiling with Nsight Systems or PyTorch profiler
Testing with custom datasets (ShareGPT, random prompts)
For complete offline benchmarking documentation, see the vLLM Offline Benchmark Guide.