The vLLM project has released version 0.18.0, a substantial update featuring 445 commits from 213 contributors including 61 new contributors. This release significantly expands deployment flexibility for production LLM serving with new protocol support, architectural improvements for multimodal workloads, and substantial enhancements to memory management that directly address operational pain points for large-scale inference deployments.
Headline Features
The v0.18.0 release introduces several major capabilities requested by production operators:
gRPC Serving Support: The biggest infrastructure addition is native gRPC serving via the new –grpc flag. This enables high-performance RPC alongside the existing HTTP/REST interface, addressing latency and throughput requirements for internal service-to-service communication. gRPCs binary protocol and streaming capabilities make it well-suited for high-volume inference scenarios.
GPU-Less Render Serving: The new vllm launch render command enables GPU-less preprocessing and rendering, allowing separation of multimodal preprocessing from GPU inference. For vision-language models and other multimodal workloads, this means CPU resources can handle image decoding, resizing, and tokenization while GPU memory remains dedicated to model inference. This architectural separation can improve overall throughput and resource utilization.
NGram GPU Speculative Decoding: NGram speculative decoding now runs on GPU and is compatible with the async scheduler, significantly reducing speculative decoding overhead. This optimization helps reduce latency for token generation by predicting and validating multiple tokens simultaneously.
Ray Dependency Change: Ray has been removed as a default dependency. Users requiring Ray for distributed serving will need to install it explicitly. This reduces the default installation footprint for users who dont need distributed capabilities.
KV Cache Offloading Gets Smarter
The KV cache management received substantial attention in this release with three significant improvements:
Smart CPU Offloading: The new smart offloading feature stores only frequently-reused KV blocks rather than the entire cache. Traditional offloading moves all KV cache blocks to CPU memory when GPU memory is constrained, but not all blocks have equal reuse probability. The smart approach analyzes access patterns and prioritizes high-value blocks, potentially maintaining more of the working set at GPU speed while still accommodating longer contexts.
FlexKV Backend: FlexKV joins the list of supported offloading backends, providing additional flexibility for different hardware and workload configurations. The modular backend approach allows vLLM to adapt to various system architectures.
Multiple KV Groups: Support for multiple KV groups in offloading configurations enables more granular memory management. This is particularly relevant for larger models where different attention heads or layers might have different caching requirements.
These changes directly address one of the primary challenges in long-context inference: efficiently managing the KV cache memory footprint without sacrificing performance for context-heavy workloads. For production deployments running models with 128K+ context windows, these optimizations can translate to serving more concurrent requests with the same hardware.
Elastic Expert Parallelism Advances
For Mixture-of-Experts (MoE) models, v0.18.0 completes Milestone 2 of the Elastic Expert Parallelism (EP) initiative. The NIXL-EP integration enables dynamic GPU scaling for MoE experts—meaning the system can adjust expert parallelism based on actual load rather than being fixed at deployment time.
A new –enable-ep-weight-filter CLI option accelerates EP model loading by filtering expert weights during the loading process. For MoE-based workloads like Mistral, DeepSeek, and other popular architectures, these improvements translate to more efficient resource utilization and faster startup times.
The elasticity aspect is particularly interesting for cloud deployments where GPU availability fluctuates. Rather than failing when the expected expert count cant be satisfied, the system can adapt to available resources while maintaining correctness.
Expanded Model Support
The release adds support for several new architectures, expanding vLLMs coverage of the open-source model landscape:
- Sarvam MoE—Indian-language-focused MoE architecture
- OLMo Hybrid—Allen Institutes open language model
- HyperCLOVAX-SEED-Think—32B VLM and 14B language models from Naver
- Kimi-Audio-7B-Instruct—Moonshots audio-capable model
- ColPali—Late-interaction retrieval models
- ERNIE Pooling Models—Baidus embedding architectures
Eagle3 speculative decoding is now available for Qwen3.5 models as well, providing additional latency reductions for that popular architecture.
API and Tooling Improvements
The OpenAI-compatible Responses API now supports streaming tool calls (function calling), enabling real-time tool use in streaming contexts. This is essential for agentic workflows where tools are invoked mid-generation.
For audio workloads, vLLM adds beam search support for encoder-decoder models in both offline and online transcription scenarios. This improves accuracy for speech recognition use cases by considering multiple hypotheses during decoding.
FlashInfer has been updated to version 0.6.6, bringing performance and correctness improvements to the attention backend.
Known Issues and Deployment Notes
Operators should note degraded accuracy when serving Qwen3.5 with FP8 KV cache on NVIDIA B200 GPUs. This is a known issue being tracked by the project.
Users who previously encountered CUBLAS_STATUS_INVALID_VALUE errors in v0.17.0 should reinstall PyTorch 2.10.0. An updated wheel addressing this issue has been published, so the previous workarounds should no longer be necessary.
Sources
- GitHub: vLLM v0.18.0 Release Notes (March 20, 2026)
- VLLM Project Documentation
