vLLM v0.19.0 Ships with Gemma 4 Support and Zero-Bubble Speculative Decoding

vLLM v0.19.0 landed this week with a massive update: 448 commits from 197 contributors, including 54 first-time contributors. This represents one of the largest release cycles in the project’s history and signals significant maturation across the entire inference stack. The headline feature is comprehensive Google Gemma 4 support, covering MoE variants, multimodal inputs, reasoning capabilities, and tool-use workflows.

Gemma 4: Full Architecture Support

The Gemma 4 integration is complete across all variants, making vLLM one of the first inference engines to fully support Google’s latest open model family. This includes mixture-of-experts (MoE) models that route tokens to specialized sub-networks, vision-language capabilities for multimodal workloads combining text and images, and the reasoning and tool-use variants that make Gemma 4 competitive with much larger proprietary models. The architecture support extends to the full parameter range from small edge models up to larger production deployments.

Requirements: You will need transformers>=5.5.0 in your environment. The vLLM team recommends using the pre-built Docker image vllm/vllm-openai:gemma4 for immediate out-of-box usage, which bundles all necessary dependencies and optimizations.

Performance Breakthrough: Zero-Bubble Speculative Decoding

The async scheduling engine now supports speculative decoding with zero-bubble overlap. This eliminates the latency gaps that traditionally occur when switching between draft and target models, significantly improving throughput for latency-sensitive workloads. The zero-bubble approach ensures that the pipeline stays fully utilized without idle periods during the draft-target transition, effectively hiding the overhead of speculative token generation.

This is particularly impactful for high-throughput serving scenarios where every millisecond matters. Early benchmarks show meaningful throughput improvements without sacrificing latency guarantees.

Model Runner V2 Maturation

The V2 Model Runner continues to mature with several production-ready enhancements:

  • Piecewise CUDA graphs for pipeline parallelism, enabling more efficient multi-GPU deployments
  • Spec decode rejection sampler with greedy sampling and logprobs support for more flexible decoding strategies
  • Multi-modal embeddings for speculative decoding, extending speculation to vision-language models
  • Streaming input support for processing long contexts incrementally
  • EPLB (Expert Parallel Load Balancing) support for efficient MoE model distribution

ViT Full CUDA Graph Capture

Vision Transformer encoders now support complete CUDA graph capture, dramatically reducing CPU overhead for multimodal workloads that process images alongside text. This optimization is especially impactful for models like LLaVA and Qwen-VL variants where vision encoding can become a bottleneck. By capturing the entire vision forward pass into a CUDA graph, the engine eliminates Python overhead and kernel launch latency.

General CPU KV Cache Offloading

The V1 engine gains a pluggable CPU KV cache offloading mechanism with configurable cache policies and block-level preemption handling. This enables serving larger models on limited GPU memory without manual sharding or model parallelism configuration. The offloading mechanism is transparent to model execution and can be tuned for different latency-throughput tradeoffs.

Additional Model Support

Beyond Gemma 4, new architectures added include Cohere’s ASR and Transcribe speech models, ColQwen3.5 4.5B for retrieval, LFM2-ColBERT-350M, Granite 4.0 1B Speech, and Qwen3-ForcedAligner for audio alignment. LoRA support continues expanding across tower and connector modules, making fine-tuned model serving more accessible.

Bottom Line

vLLM v0.19.0 is a significant step forward for inference performance and model coverage. The combination of Gemma 4 support and zero-bubble speculative decoding makes this release worth prioritizing for anyone running production LLM workloads. With nearly 200 contributors and hundreds of commits, the momentum behind vLLM continues to accelerate.


Sources