Tag: vLLM

vLLM 0.16.0 Raises the Bar for Open-Source Inference Serving

vLLM 0.16.0 lands with async scheduling and pipeline parallelism, a new WebSocket-based Realtime API, speculative decoding improvements, and major platform work—including an overhaul for XPU support. Here’s why those details matter to teams building reliable, cost-efficient inference stacks.

Multi-LoRA at Scale: How vLLM + AWS Aim to Stop Paying for Idle GPUs

AWS and the vLLM community describe multi-LoRA serving for Mixture-of-Experts models, with kernel and execution optimizations that let many fine-tuned variants share a single GPU. The pitch: higher utilization, better latency, and a clearer path to serving ‘dozens of models’ without dozens of endpoints.