vLLM Serving#

vLLM is a high-performance inference engine designed for production-grade serving of large language models (LLMs). It delivers exceptional throughput and low latency through advanced optimizations including PagedAttention for efficient memory management, continuous batching for maximizing GPU utilization, and highly optimized CUDA kernels. vLLM provides an OpenAI-compatible API, making it a drop-in replacement for OpenAI services while supporting distributed inference across multiple GPUs and nodes for serving the largest models.

This guide covers everything from basic single-GPU deployment to advanced multi-node distributed serving with tensor parallelism, pipeline parallelism, data parallelism, and expert parallelism for Mixture-of-Experts (MoE) models. All scripts and examples are located in the src/llm/vllm/ directory.

Quick Start#

Get started with vLLM in minutes. Install the package and launch a server with a pre-trained model. The server exposes an OpenAI-compatible API endpoint that you can query with standard HTTP requests or the OpenAI Python client.

# Install vLLM
pip install vllm

# Start server with Qwen 7B model
# The server will automatically download the model from HuggingFace
vllm serve Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 8000

# Test the server with a completion request
# The API is compatible with OpenAI's completion endpoint
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "prompt": "Hello, my name is",
    "max_tokens": 50
  }'

Common API Examples#

vLLM provides an OpenAI-compatible API with support for completions, chat completions, and embeddings. Here are the most common usage patterns:

Completions - Generate text from a prompt:

curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "prompt": "Explain quantum computing in simple terms:",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Chat Completions - Multi-turn conversations:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 50
  }'

Streaming Responses - Get tokens as they’re generated:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Write a short poem"}],
    "stream": true
  }'

List Models - Check available models:

curl http://localhost:8000/v1/models

For more API examples including batch completions, sampling parameters, logprobs, stop sequences, and other advanced features, refer to test.sh.

Tensor Parallel (TP)#

Tensor parallelism splits individual model layers across multiple GPUs, with each GPU holding a portion of the weight matrices. This is essential for models that don’t fit in a single GPU’s memory. All GPUs participate in every forward pass, communicating via all-reduce operations. Best for large dense models like Llama 70B or Qwen 72B.

Use when: Model doesn’t fit on a single GPU, or you need to reduce per-GPU memory usage.

# Serve 14B model across 8 GPUs using tensor parallelism
# Each GPU holds 1/8 of the model weights
vllm serve Qwen/Qwen2.5-14B-Instruct --tensor-parallel-size 8

# Formula: Each GPU holds 1/TP of model weights
# Communication: All-reduce on every forward pass

Pipeline Parallel (PP)#

Pipeline parallelism divides the model into sequential stages, with each stage assigned to different GPUs. Unlike tensor parallelism where all GPUs work on every layer, pipeline parallelism processes different parts of the model on different GPUs. This reduces communication overhead since GPUs only need to pass activations between stages rather than synchronizing on every operation. Ideal for very deep models or when network bandwidth is limited.

Use when: You want to reduce inter-GPU communication or scale across nodes with slower interconnects.

# Serve 14B model with pipeline parallelism across 8 GPUs
# PP=2 splits model into 2 stages, TP=4 within each stage
vllm serve Qwen/Qwen2.5-14B-Instruct \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2

# Formula: TOTAL_GPUS = TP × PP
# Stage 0: Layers 0-N/2 on GPUs 0-3 (TP=4)
# Stage 1: Layers N/2-N on GPUs 4-7 (TP=4)
# Communication: Only between pipeline stages

Data Parallel (DP)#

Data parallelism creates multiple independent replicas of the model, each processing different requests simultaneously. This is the most effective way to increase throughput when you have sufficient GPU memory for multiple model copies. Each replica can use tensor parallelism internally. Perfect for high-traffic production deployments where you need to serve many concurrent requests.

Use when: You need higher request throughput and have enough GPUs to replicate the model.

# Create 2 model replicas, each using 8 GPUs with tensor parallelism
# Total: 16 GPUs serving 2x the requests
vllm serve Qwen/Qwen2.5-14B-Instruct \
  --tensor-parallel-size 8 \
  --data-parallel-size 2 \
  --data-parallel-backend ray

# Formula: TOTAL_GPUS = DP × TP
# Replica 0: GPUs 0-7 (serves requests independently)
# Replica 1: GPUs 8-15 (serves requests independently)
# Each replica can handle different requests in parallel

Expert Parallel (EP)#

Expert parallelism is specifically designed for Mixture-of-Experts (MoE) models, where the model contains multiple expert sub-networks and a gating mechanism routes tokens to different experts. EP shards the experts across GPUs, allowing each GPU to hold a subset of experts. This is crucial for MoE models like Qwen2.5-MoE or DeepSeek-V3, which can have hundreds of experts. vLLM automatically computes the expert parallelism degree based on your DP and TP configuration.

Use when: Serving MoE models that have too many experts to fit on a single GPU.

# Serve 14B MoE model with expert parallelism across 8 GPUs
# Experts are automatically sharded across all available GPUs
vllm serve Qwen/Qwen2.5-14B-MoE \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --all2all-backend allgather_reducescatter

# EP is computed automatically: EP = DP × TP
# With TP=8, EP=8, each GPU holds 1/8 of the experts
# All-to-all communication routes tokens to the right experts