NVIDIA Blackwell Sweeps MLPerf Training 6.0 as Open-Source Inference Engines Race to Agentic Readiness

The AI infrastructure landscape is undergoing its most consequential shift since the original Transformer paper. While headline-grabbing model releases still dominate tech news, the real story of mid-2026 is happening one layer down: in the inference engines, training clusters, and MLOps pipelines that actually deliver AI to users. This week brought a flurry of major updates across the stack, from NVIDIA’s benchmark-crushing Blackwell platform to open-source inference engines shipping production-grade agentic optimizations.

NVIDIA Blackwell Sweeps MLPerf Training 6.0

NVIDIA delivered what can only be described as a clean sweep in MLPerf Training v6.0, the latest round of industry-standard AI training benchmarks from the MLCommons consortium. The company achieved the fastest time-to-train at scale and the highest per-accelerator performance on every single benchmark submitted. It was also the only platform to submit results on every test in the suite.

The crown jewel of NVIDIA’s submission was the GB300 NVL72 system, which links 72 Blackwell Ultra GPUs and 36 Grace CPUs as a single unified compute domain via NVLink and NVLink Switch. This architecture proved particularly dominant on the two new pretraining benchmarks introduced in this round: DeepSeek-V3 (a 671B-parameter Mixture of Experts model) and GPT-OSS-20B (a compact but capable MoE). On DeepSeek-V3, an 8,192-GPU cluster trained the model in just 2.02 minutes. On GPT-OSS-20B, a 2,048-GPU configuration finished in 7.43 minutes.

What makes these numbers meaningful is not just the raw speed, but the scale-out efficiency. NVIDIA cloud partners scaled production clusters to 8,192 Blackwell GPUs working in unison across diverse hyperscale data centers, demonstrating strong scaling trends in real-world fleet conditions. The company credits several software innovations for these results:

  • Full-iteration CUDA graphs for token-dropless MoEs: Historically, dynamic routing in MoE architectures forced continuous CPU-GPU synchronizations that broke CUDA graph optimization. NVIDIA eliminated these sync points by deriving input shapes directly from GPU values and managing device memory via “paged stashing,” removing the CPU entirely from the critical path.
  • CuTe DSL kernel fusions: Advanced kernel fusion via CuTe DSL combined memory-bandwidth-bound layers with grouped GEMM operations, keeping data local to registers and avoiding expensive round-trips to global memory. NVIDIA reports more than 8% end-to-end benefit on DeepSeek-V3 and a 93% speedup on GPT-OSS.
  • MXFP8 attention blocks: A new mixed-precision attention recipe improved performance without impacting model quality, a critical optimization for memory-bound training workloads.

The networking story is equally important. NVIDIA’s Spectrum-X Ethernet uses Advanced Adaptive Routing to distribute traffic packet-by-packet across all available paths according to real-time link load, while its congestion control detects incast patterns early and paces senders before buffers overflow. For anyone building large training clusters, this fabric-level intelligence is becoming table stakes.

Dynamo: Building an Agent-Native Inference Stack

While Blackwell wins the training benchmarks, NVIDIA’s Dynamo project is quietly becoming one of the most important pieces of inference infrastructure for the agentic era. In a detailed technical post, the Dynamo team outlined how they are optimizing the entire stack for agentic workloads, where inference patterns look fundamentally different from traditional chat completions.

The core insight is simple but profound: agentic inference is write-once-read-many (WORM). After the first API call writes the conversation prefix to KV cache, every subsequent call hits 85-97% cache reuse. Agent teams (or “swarms”) push this even further, with NVIDIA reporting 97.2% aggregate cache hit rates across multi-agent setups. This creates an 11.7x read-to-write ratio, meaning the system reads from cache nearly 12 times for every token it writes.

Dynamo attacks this problem at three layers:

1. Multi-Protocol Frontend

Dynamo serves v1/chat/completions, v1/messages, and v1/responses through a common internal representation. This matters because the newer responses API uses typed content blocks, allowing the orchestrator to see boundaries between thinking, tool calls, and text, and apply different cache and scheduling policies per block type.

2. Agent Hints

Dynamo’s new agent_hints extension lets agent harnesses attach structured metadata to requests: estimated output length, priority levels, and speculative prefill signals. A harness can tell the orchestrator “this tool call is about to return, warm the cache now,” enabling proactive cache warming across tool-call gaps.

3. KV Cache Retention

With sessions running for minutes to days and long tool-call pauses in between, Dynamo supports cache pinning with TTL controls, protecting computed prefixes from eviction during idle periods. This is infrastructure that simply did not exist for open-source deployments until recently.

NVIDIA is running internal Dynamo deployments of GLM-5 and MiniMax-M2.5 to benchmark against closed-source inference, targeting parity on cache reuse performance. For teams running open models on their own GPUs, this is a significant development.

vLLM 0.23.0: The Open Inference Engine Grows Up

The vLLM project shipped v0.23.0 this week with 408 commits from 200 contributors, and the release notes read like a roadmap for where open-source inference is heading. Several developments stand out:

DeepSeek-V4 maturation: Following its introduction in v0.22.0, DeepSeek-V4 received a major hardening pass with sparse MLA metadata decoupled from DeepSeek-V3.2, TRTLLM-gen attention kernels, EPLB support for the Mega-MoE, selective prefix-cache retention for sliding-window KV cache, and an index-share feature for DSA MTP.

Model Runner V2 expansion: MRv2 is now the default for Llama and Mistral dense models (in addition to Qwen3), with FlashInfer sampler integration, breakable CUDA graphs, pipeline-parallel bubble elimination, and Gemma 4 MTP support.

Multi-tier KV cache offloading: The offloading framework gained an object-store secondary tier, HMA enabled by default for capable connectors, tiering support for HMA models, and per-request offloading policies via lifecycle hooks. This is a direct response to the same memory pressure that Dynamo is solving at the NVIDIA layer.

Rust frontend maturation: The experimental Rust frontend added streaming generate endpoints, dynamic LoRA endpoints, tool parsers for InternLM2, hy_v3, Phi-4-mini, and Gemma4, plus request-ID headers and server-router extension hooks.

vLLM also added support for Transformers v5, new models including Step-3.7-Flash, Cosmos3 Reasoner, JetBrains Mellum v2, Granite Speech Plus, and Cohere Mini Code. The project is clearly positioning itself as the universal inference runtime for the open model ecosystem.

Ollama 0.30: Local Inference Gets Serious

On the local deployment front, Ollama 0.30 continues its rapid evolution. The headline feature is improved compatibility and performance using llama.cpp, which augments the existing MLX engine on Apple Silicon and brings broader hardware support including faster performance on NVIDIA GPUs.

Key updates include:

  • Gemma 4 QAT weights: Quantization-Aware Training dramatically reduces memory requirements for on-device performance, with optimized tags for the full Gemma 4 family.
  • Cohere2Moe architecture support: Expanding Ollama’s MoE model coverage.
  • Hermes Desktop integration: A native desktop interface for the Hermes agent, runnable via ollama launch hermes-desktop.
  • Improved prompt caching: Decoupled from context shift for better KV cache reuse, directly addressing the same WORM pattern that Dynamo targets at scale.
  • MLX embedding layers with NVFP4 global scale: Improved quantization on Apple Silicon.

Ollama also launched integration with Oh My Pi (an AI coding agent) and Cline CLI, plus support for ollama launch commands for Qwen Code, Codex, and Pi. The project is evolving from a simple model runner into a full local AI agent platform.

LiteLLM 1.90: The Gateway Standard Tightens

LiteLLM v1.90.0-rc.1 shipped with a notable security enhancement: all Docker images are now signed with cosign, with verification instructions published for every release. In an era where supply chain attacks on AI infrastructure are a growing concern, this is a welcome development.

Other significant updates include standardized rate limit errors with structured fields (category, rate_limit_type, model, llm_provider), MCP tool configuration improvements, Azure AI MAI-Image-2.5 image generation support, and continued UI refinements for team management and guardrails. LiteLLM remains the de facto standard for organizations that need to route inference across multiple providers and models.

OpenClaw and Mistral: Agent Security and Scale

Two other developments merit attention. OpenClaw announced a collaboration with NVIDIA for stronger agent skill security, with every ClawHub skill now shipping with a Skill Card documenting capabilities and provenance, plus scanning by SkillSpector for hidden instructions and agentic risks. The project also introduced a Skill Workshop for reviewing and applying proposed skills before they change agent behavior.

Mistral continues its rapid product expansion with the launch of Vibe (a unified agent for productivity and coding), Forge (a system for enterprises to build frontier-grade models on proprietary knowledge), and a deepening partnership with NVIDIA to accelerate open frontier models. Mistral also published a notable engineering post on debugging a memory leak in vLLM, contributing back to the open-source inference ecosystem.

What This Means for Infrastructure Teams

Three trends are converging that every AI infrastructure team should be tracking:

1. Agentic inference is becoming a first-class workload. The WORM access pattern, KV cache pressure, and multi-turn tool-call loops are no longer edge cases, they are the primary use case. Infrastructure that optimizes for chat completions but chokes on agentic workloads will need significant rework.

2. Scale-out training and inference are merging into a unified fabric story. NVIDIA’s Blackwell dominance in training and its Dynamo push in inference are not separate stories, they are two sides of the same coin: a unified compute fabric that can handle both massive training runs and low-latency agentic serving.

3. The open-source inference stack is hitting production maturity. vLLM, Ollama, and LiteLLM are no longer experimental projects. They are shipping the features that enterprise teams need: multi-tier caching, protocol flexibility, security signing, and model coverage that rivals closed-source alternatives.

The next six months will likely see these trends accelerate as more organizations move from AI experiments to production agentic systems. The infrastructure is ready. The question is whether the teams building on top of it are.

Sources