vLLM v0.19.0 released this week with 448 commits from 197 contributors including 54 new to the project. This version brings substantial improvements for production LLM inference: full Google Gemma 4 architecture support, breakthrough async scheduling enhancements, and continued maturation of the Model Runner V2 framework.
For teams running inference at scale, these changes deliver meaningful throughput improvements and broader model coverage. Here is what is new and why it matters for production deployments.
Gemma 4: Full Architecture Support
Google’s Gemma 4 release represents a significant evolution of open weights models. vLLM v0.19.0 provides comprehensive support including Mixture of Experts for routing and expert computation in sparse MoE variants. Multimodal capabilities for vision-language processing in multimodal Gemma variants. Reasoning mode for models trained with chain-of-thought reasoning. Tool use for function calling and external tool integration.
Requirements: transformers greater than or equal to 5.5.0. What this means in practice: you can now run the latest Gemma 4 open weights models through vLLM with full feature parity to Google’s reference implementation. This includes the instruct-tuned variants suitable for conversational applications and the base models for further fine-tuning.
Gemma 4 introduces architectural improvements that push efficiency boundaries for open models. The MoE variants achieve larger effective parameter counts while maintaining inference speed comparable to smaller dense models. For production deployments, this translates to lower per-token costs without sacrificing quality.
The multimodal support is particularly notable—vLLM’s implementation handles the vision encoder and language model as an integrated pipeline, simplifying deployment compared to stitching together separate services.
Zero-Bubble Async Scheduling
v0.19.0’s headline performance feature is the combination of zero-bubble async scheduling with speculative decoding. This is a significant throughput optimization for high-volume inference serving.
Traditional scheduling for autoregressive LLMs leaves GPU cycles idle while waiting for the next token generation step to complete. These idle periods called bubbles represent wasted compute capacity that could be serving other requests.
Zero-bubble async scheduling overlaps computation for different requests so that when one request is waiting for memory operations or token generation, another request’s computation fills the gap. The GPU stays fully utilized rather than sitting idle.
When combined with speculative decoding which decodes multiple candidate tokens simultaneously and validates them in parallel, the throughput improvements compound. Speculative decoding reduces the number of forward passes needed per actual output token, while zero-bubble scheduling ensures the GPU is never waiting.
Early benchmarks from the vLLM community show 15 to 40 percent throughput improvements for typical serving workloads, with higher gains for variable-length sequences where bubbles were previously more pronounced. The gains are most noticeable in multi-tenant serving with mixed request lengths, chat completions with streaming responses, and batch processing where total latency matters less than throughput.
Model Runner V2 Maturation
The Model Runner V2 framework continues to evolve, with v0.19.0 adding piecewise CUDA graphs for pipeline parallelism and speculative decoding rejection sampler with greedy and logprobs support.
Model Runner V2 represents vLLM’s next-generation model execution framework designed for cleaner separation between model architecture definitions and runtime optimizations, enabling faster addition of new models and features.
If you are deploying vLLM at scale, the V2 framework offers better extensibility but may require migration from custom V1 integrations. The vLLM team maintains compatibility for existing deployments while encouraging new features on the V2 path.
Upgrade Considerations
Before upgrading to v0.19.0, review these considerations. Breaking changes include Gemma 4 requiring transformers 5.5.0 or higher. Earlier versions will fail to load Gemma 4 checkpoints. Certain V1 custom model implementations may require minor adjustments for V2 compatibility.
Zero-bubble async scheduling is enabled through existing async scheduling flags, but the speculative decoding integration requires explicit configuration. Start with speculative length of 3 to 5 tokens and tune based on your specific model and workload characteristics.
The Gemma 4 MoE variants require significant GPU memory for the full parameter set. The async scheduling improvements benefit any supported GPU but are most impactful on A100 and H100 class hardware where compute is abundant relative to memory bandwidth.
The Bottom Line
vLLM v0.19.0 is a substantial upgrade for production inference. The combination of Gemma 4 support and throughput improvements from zero-bubble scheduling with speculative decoding addresses two common pain points: access to state-of-the-art open models and efficient hardware utilization.
If you are running production LLM workloads, the throughput improvements alone justify the upgrade. The Gemma 4 support broadens your model options with Google’s latest open weights releases. As always with vLLM, test thoroughly in staging before production deployment particularly if you are using custom model implementations or advanced scheduling configurations.
Sources
- vLLM v0.19.0 Release Notes (April 3, 2026)
- Google Gemma 4 Announcement
- vLLM Documentation
- vLLM GitHub Repository
