The vLLM project has released v0.19.0, a major update bringing 448 commits from 197 contributors. This release represents a significant advancement for AI inference workloads, introducing support for Google’s Gemma 4 architecture alongside substantial performance optimizations that practitioners have been anticipating.
Gemma 4 Support Lands
The headline feature is full support for Google’s Gemma 4 architecture, including MoE (Mixture of Experts), multimodal capabilities, reasoning models, and tool-use functionality. vLLM v0.19.0 requires transformers>=5.5.0 for Gemma 4 support, with the project recommending their pre-built Docker image vllm/vllm-openai:gemma4 for immediate production use.
This positions vLLM as one of the first inference engines to fully support Gemma 4’s complex architecture. For organizations evaluating Google’s latest open weights models, v0.19.0 removes a significant integration barrier.
Zero-Bubble Async Scheduling + Speculative Decoding
Perhaps more consequential for production deployments is the maturation of vLLM’s async scheduling system. Version 0.19.0 adds zero-bubble async scheduling with speculative decoding—meaning the inference pipeline can overlap prefill and decode operations without gaps, significantly improving throughput for latency-sensitive workloads.
The Model Runner V2 (MRV2) subsystem sees continued investment, adding:
- Piecewise CUDA graphs for pipeline parallelism
- Spec decode rejection sampler with greedy/logprobs support
- Multi-modal embeddings for speculative decoding
- Streaming inputs support
- EPLB (Expert Parallel Load Balancing) support
ViT Full CUDA Graph Capture
Vision Transformer (ViT) encoders now support full CUDA graph capture, reducing overhead for multimodal models that process both text and images. This addresses a long-standing performance gap where vision components added unpredictability to inference latency. With full CUDA graph capture, vision encoders now exhibit the same predictable latency characteristics as text-only models.
CPU KV Cache Offloading and Generalized DBO
v0.19.0 introduces a general CPU KV cache offloading mechanism for V1, complete with pluggable cache policies and block-level preemption handling. This allows deployments to trade GPU memory for latency—useful for serving large models on hardware-constrained environments or maximizing throughput on shared infrastructure.
The Dual-Batch Overlap (DBO) microbatch optimization now works with general models, not just specific architectures. Previously limited to particular model configurations, DBO can now accelerate inference across the supported model catalog.
Hardware Support Expansions
The release adds support for NVIDIA B300/GB300 (SM 10.3) with all-reduce fusion enabled by default and tuned communicators. Intel XPU gains MLA model support and CompressedTensor W4A8 support. AMD ROCm improves to ROCm 7.2.1 with torch 2.10 and triton 3.6, adds DeepEP as an all2all backend, and introduces persistent MLA kernels via AITER.
Transformers v5 Compatibility
With HuggingFace Transformers v5 recently released, vLLM v0.19.0 includes extensive compatibility fixes across the model catalog. This prevents the class of integration failures that typically accompany major Transformers releases, allowing teams to upgrade their model ecosystem components without coordination friction.
Model Support Additions
New architectures supported include Cohere ASR, Cohere Transcribe, ColQwen3.5 4.5B, LFM2-ColBERT-350M, Granite 4.0 1B Speech, and Qwen3-ForcedAligner. The speculative decoding system gains Eagle3 support for Pixtral and various stability improvements.
Security Hardening
The release adds VLLM_MAX_N_SEQUENCES environment variable enforcement and frame limits in VideoMediaIO to prevent resource exhaustion attacks. Teams operating vLLM in multi-tenant or exposed environments should prioritize this update.
Sources
- vLLM v0.19.0 Release Notes — vLLM Project
