AgentPerf Benchmark Launches, vLLM v0.23.0 Ships: AI Infrastructure This Week

The AI infrastructure conversation has shifted. For years, the industry optimized for one-shot chat completion: a user sends a prompt, the model responds, done. But agentic AI — systems that reason across dozens of tool calls, maintain state across hours, and spawn parallel subagents — has exposed the cracks in that architecture. The inference stack is being rebuilt, and the rebuild is happening fast.

This week brought three major signals that the shift is accelerating: the first-ever AgentPerf benchmark for agentic workloads went live, vLLM v0.23.0 shipped with DeepSeek-V4 and multi-tier KV cache improvements, and NVIDIA detailed how its Dynamo framework and DOCA security stack are being rebuilt for agents.

The Problem: Inference Was Not Built for Agents

Traditional inference assumes each request is independent. Agentic workloads break that assumption completely. A coding agent like Claude Code or Codex makes hundreds of API calls per session, each carrying the full conversation history. After the first call writes the system prompt and context to KV cache, every subsequent call hits 85–97% cache reuse. The system reads from cache nearly 12 times for every token it writes — a write-once-read-many (WORM) pattern that legacy serving stacks were never designed to handle.

The cost of getting this wrong is steep. Default round-robin routing gives turn two of a conversation roughly a 1/N chance of landing on the same worker as turn one. Every miss forces a full prefix recomputation — expensive, slow, and invisible to users until their agent hangs for seconds between tool calls.

AgentPerf: The First Real Benchmark for Agentic Workloads

Benchmarking agentic workloads is notoriously hard — trajectories are non-deterministic, tool calls introduce variable latency, and traditional throughput metrics miss the point. AA-AgentPerf, launched by Artificial Analysis this week, is the first attempt to solve this at scale. It measures how many concurrent agents an inference system can support while meeting strict service-level objectives for output speed and time-to-first-token, using private prerecorded trajectories from real coding tasks across 12+ programming languages.

On launch day, NVIDIA GB300 NVL72 achieved up to 20x higher concurrent agents per megawatt than NVIDIA H200, with 61.4K agents supported per megawatt at SLO tier 30. The benchmark uses DeepSeek-V4-Pro across multiple SLO tiers and simulates representative CPU-side tool-call latency with a one-second median delay. This is the first time the industry has an apples-to-apples way to compare hardware for agentic inference.

NVIDIA Dynamo: Three Layers of Agent-Native Infrastructure

NVIDIA’s Dynamo framework is a full-stack redesign across three layers. The frontend introduces agent hints — structured metadata that harnesses (Claude Code, Codex, OpenClaw, etc.) can attach to requests to signal priority, estimated output length, and cache prefetching needs. The router maintains a global index of KV cache blocks across workers, with the Flash Indexer operating at 170 million ops/second. Early integrations from the NeMo Agent Toolkit show 4x reduction in p50 TTFT and 1.5x increase in tokens-per-second versus default routing.

The most radical change is KV cache management. Dynamo builds a four-tier memory hierarchy — GPU → CPU → local NVMe → remote storage — with global deduplication by sequence hash. When a lead agent computes its system prompt and tool definitions, those blocks write through to shared storage. When subagents spawn on different workers, they load the prefix via RDMA instead of recomputing it. Four redundant prefill computations become one compute and three loads. Selective cache retention lets harnesses assign priority per block type: system prompts (highest), conversation history (high), reasoning tokens (low).

vLLM v0.23.0: The Open Source Stack Catches Up

While NVIDIA builds the high-performance end, the open-source inference stack is not standing still. vLLM v0.23.0 shipped this week with 408 commits from 200 contributors, including:

DeepSeek-V4 maturation across backends: sparse MLA metadata decoupled from V3.2, TRTLLM-gen attention kernel, EPLB support for Mega-MoE, selective prefix-cache retention for sliding-window KV cache, and an XPU attention decode path.
Model Runner V2 expanding to Llama and Mistral dense models by default, in addition to Qwen3, with FlashInfer sampler, breakable CUDA graphs, pipeline-parallel bubble elimination, and Gemma 4 MTP support.
Multi-tier KV cache offloading gaining an object-store secondary tier, HMA enabled by default for capable connectors, and tiering support for HMA models.
Transformers v5 compatibility, deprecating v4 support.
A growing Rust frontend with streaming generate endpoints, dynamic LoRA endpoints, and version/server_info APIs.

These are not marginal updates. vLLM’s multi-tier KV cache offloading is converging on the same insight NVIDIA is betting billions on: inference infrastructure must be rebuilt around reuse, not just throughput.

Ollama v0.30.x: Local Inference Gets Smarter

Ollama v0.30.8 shipped this week with prompt caching decoupled from context shift for better KV cache reuse, hardened MLX inference on Apple Silicon with NVFP4 global scale for improved quantization, and more stable recurrent model support. Ollama also added Gemma 4 QAT (Quantization-Aware Training) weights and Nemotron-3-Ultra for high-throughput reasoning. For developers running agents locally, these caching improvements directly translate to faster multi-turn sessions and better memory efficiency.

Security at Agentic Scale: NVIDIA DOCA

The infrastructure story is not complete without security. NVIDIA’s DOCA in-silicon security platform, running on BlueField-4 DPUs, introduces runtime threat detection (DOCA Argus), zero-trust data access control (DOCA Vault), and hardware-accelerated network policy enforcement (DOCA Flow). The pitch is straightforward: security must operate at AI speed and scale, without consuming host CPU resources or competing with inference workloads. With agents gaining increasing authority to act autonomously, securing the infrastructure layer becomes non-negotiable.

What It Means for Builders

If you are running agents today on legacy inference stacks, you are likely leaving performance on the table — and not by small margins. The WORM access pattern, KV cache misses, and stateless routing are all solvable problems, but solving them requires infrastructure that understands agents as first-class citizens, not afterthoughts.

The good news: the tools are arriving. AgentPerf gives you a way to measure whether your stack is actually delivering. Dynamo is open-source. vLLM and Ollama are shipping agent-aware improvements weekly. The rebuild is underway. The question is whether your infrastructure is keeping up.