vLLM 0.17: PyTorch 2.10 Upgrade and FlashAttention 4 Integration

vLLM has released version 0.17, featuring 699 commits from 272 contributors including 48 new contributors. This release brings major infrastructure upgrades and performance improvements for LLM inference.

PyTorch 2.10 Upgrade

vLLM 0.17 upgrades to PyTorch 2.10.0, a breaking change for environment dependencies. Users should update their environments accordingly.

Note for CUDA 12.9+ users: If you encounter CUBLAS_STATUS_INVALID_VALUE errors, this is caused by a CUDA library mismatch. Solutions include removing system CUDA paths from LD_LIBRARY_PATH or reinstalling with specific CUDA version flags.

FlashAttention 4 Integration

vLLM now supports the FlashAttention 4 backend, bringing next-generation attention performance. FlashAttention 4 delivers improved memory efficiency and faster attention computation, particularly beneficial for long-context workloads.

Model Runner V2 Improvements

Model Runner V2 has reached a major milestone with Pipeline Parallel support, enabling more efficient distributed inference across multiple GPUs.

New Model Support

vLLM 0.17.1 adds support for Nemotron 3 Super, NVIDIA’s 122B parameter model with strong reasoning capabilities. This model is also available through Ollama for local deployment.

Bug Fixes and Improvements

  • Fixed activation_type passing for TensorRT-LLM fused MoE
  • Re-enabled expert parallelism for TensorRT-LLM MoE FP8 backend
  • Fixed DeepSeek-V3.2 MTP indexer handling
  • Zero freed SSM cache blocks on GPU for Mamba and Qwen3.5

Sources