Agentic AI Is Rewriting the Rules of Inference Infrastructure

The agentic AI wave is no longer theoretical. Stripe’s coding agents generate over 1,300 pull requests per week. Ramp attributes 30% of merged PRs to autonomous agents. Spotify reports 650+ agent-generated PRs monthly. Behind every one of these workflows sits an inference stack under unprecedented pressure — and the infrastructure layer is evolving rapidly to meet it.

Over the past two weeks, the AI infrastructure ecosystem has seen significant momentum across hardware benchmarks, inference engines, and orchestration tools. From NVIDIA’s first agentic AI benchmark to vLLM’s massive v0.23.0 release and Ollama’s continued expansion into desktop agents, the story is clear: inference infrastructure is being rebuilt from the ground up for agent-native workloads.

Benchmarking Agentic Inference: NVIDIA Sets the Standard

One of the most significant developments in AI infrastructure this month is the introduction of AA-AgentPerf, the industry’s first multi-vendor open benchmark for profiling agentic coding trajectories. Created by Artificial Analysis, this benchmark measures how many concurrent AI agents an inference system can support while meeting predefined service level objectives (SLOs) for output token speed and time-to-first-token (TTFT).

What makes AA-AgentPerf different from traditional LLM benchmarks is its focus on trajectories — the complete sequence of actions, decisions, and observations an agent makes as it traverses a task from beginning to end. These trajectories include interleaved reasoning and tool calls across 12+ programming languages, with sequence lengths ranging from 5K to 131K tokens and a mean of approximately 27K. Tool calls are simulated with a one-second median delay, reflecting realistic CPU-side task execution.

The benchmark’s launch-day results are striking. NVIDIA GB300 NVL72 delivers up to 20x more concurrent agents per megawatt than the previous-generation H200. At SLO tier 30 (30 tokens/second), GB300 NVL72 supports 61,400 concurrent agents per megawatt compared to H200’s 2,600 — and 57.5 concurrent agents per GPU versus 1.4. These numbers reflect what NVIDIA calls “extreme co-design” across hardware, software, and networking:

WideEP and DeepEP optimizations spread Mixture-of-Experts (MoE) execution across the full NVL72 domain
DeepGEMM and Mega MoE with MXFP4/MXFP8 kernels fuse communication with tensor core compute
NVLink scale-up links 72 GPUs into a single fabric for rapid KV cache and parameter sharing

Looking ahead, the NVIDIA Vera Rubin platform is expected to extend these gains further with 50 PFLOPs of NVFP4 compute, continuing the trajectory of agentic-optimized hardware.

Dynamo: Building the Agent-Native Inference Stack

While benchmarks measure performance, NVIDIA’s Dynamo project is building the infrastructure to actually achieve it. Dynamo is a full-stack inference engine designed specifically for agentic workloads — and its architecture reveals just how different agentic inference is from traditional chatbot serving.

The key insight driving Dynamo’s design is what NVIDIA calls the WORM pattern (Write-Once-Read-Many). In agentic coding sessions, the system prompt and conversation prefix are computed once, then read from cache on every subsequent call. Claude Code, for example, achieves an 85-97% cache hit rate after the first API call, with agent teams pushing this to 97.2% across 4 Opus teammates. The cumulative read-to-write ratio reaches 11.7x — the system reads from cache nearly 12 times for every token it writes.

Dynamo addresses this at three layers:

Layer 1: The Frontend API

Dynamo serves all three major API endpoints — v1/chat/completions, v1/responses, and v1/messages — through a common internal representation. This matters because modern agent harnesses increasingly use v1/responses and v1/messages for their typed content blocks, which allow the orchestrator to see block boundaries and apply different cache and scheduling policies for thinking, tool calls, and text.

Layer 2: Agent Hints

Dynamo introduces an agent hints extension that allows harnesses to attach structured metadata to requests — signals like expected output sequence length, priority levels, and speculative prefill requests. This bridges the gap between what the inference infrastructure sees (anonymous tokenized requests) and what the agent harness knows (which agents are blocked on tool calls, how many turns remain, and whether a call is a quick lookup or long synthesis).

Layer 3: KV Cache Management

The orchestrator implements cache control with TTL-based eviction protection, keeping KV blocks warm during tool-call gaps that can last minutes or even days. This is critical because cold starts in agent sessions are expensive — every prefix must be recomputed from scratch.

vLLM v0.23.0: The Open Engine Expands

While NVIDIA pushes proprietary optimizations, the open-source inference engine vLLM continues its relentless pace with the v0.23.0 release — featuring 408 commits from 200 contributors, 63 of them new. This release is a major milestone for production inference infrastructure.

The headline improvement is DeepSeek-V4 maturation. Following its introduction in v0.22.0, the model received a comprehensive hardening pass including decoupled sparse MLA metadata, a TRTLLM-gen attention kernel, EPLB support for the Mega-MoE architecture, selective prefix-cache retention for sliding-window KV cache, and an index-share feature for DSA MTP. The model was also detached from torch.compile for broader compatibility, with refactored attention and RoPE paths.

Model Runner V2 — vLLM’s next-generation execution engine — is now selected by default for Llama and Mistral dense models in addition to Qwen3. It brings FlashInfer sampling, breakable CUDA graphs, pipeline-parallel bubble elimination, and improved kernel block-size support for hybrid models.

Other notable infrastructure improvements in v0.23.0 include:

Multi-tier KV cache offloading with object-store secondary tier support and per-request offloading policies
Transformers v5 compatibility with vendored processors and model-specific fixes
Rust frontend expansion with streaming generate endpoints, dynamic LoRA support, and new tool parsers
Unified parser architecture for reasoning and tool-call parsing behind a single interface
New model support for Gemma 4 Unified, Step-3.7-Flash, Cosmos3 Reasoner, and Cohere Mini Code

The Local Inference Ecosystem: Ollama and llama.cpp

While datacenter inference grabs headlines, the local and edge inference stack is equally active. Ollama released v0.30.8 with improved prompt caching decoupled from context shift for better KV cache reuse, more stable MLX inference with hardened linear and embedding layers, and improved recurrent model support.

Perhaps more interestingly, Ollama Launch now supports Hermes Desktop — a native desktop interface for the Hermes agent. Users can run ollama launch hermes-desktop to get a visual interface for managing conversations, integrations, and messaging apps. This signals Ollama’s evolution from a simple model runner to a full agent platform.

llama.cpp continues its daily release cadence with build b9642, featuring WebGPU improvements for I-quants matmul performance, expanded platform support including Ubuntu s390x, and continued refinement of the iOS XCFramework and Android builds. The project’s relentless pace — daily builds across dozens of platforms — keeps it the reference implementation for quantized local inference.

LiteLLM and OpenClaw: The Glue Layer

Infrastructure is not just about inference engines — it’s about how everything connects. LiteLLM continues to serve as the universal router, with v1.88.2 adding Fable 5 support, CrowdStrike AIDR integration, Mantle Responses SigV4 authentication, and Docker image signing via cosign for supply-chain security.

OpenClaw — the open-source personal AI agent runtime — has been particularly active, collaborating with NVIDIA on agent skill security through SkillSpector scanning and Skill Card documentation. Its recent releases (v2026.6.8-beta.1) bring richer Telegram and WhatsApp channel delivery, improved gateway recovery, and the new Skill Workshop for turning agent work into reusable skills.

What This Means for Infrastructure Teams

The convergence of these developments points to several clear trends for teams building AI infrastructure:

First, KV cache is the new bottleneck. As agentic workloads produce 11.7x read-to-write ratios, infrastructure teams must optimize for cache retention, routing, and tiered storage rather than raw throughput. The WORM pattern demands fundamentally different system design.

Second, benchmarks are maturing. AA-AgentPerf establishes a much-needed standard for measuring agentic inference performance. Teams can now compare hardware and software configurations using realistic, trajectory-based workloads rather than synthetic token-per-second metrics.

Third, the open stack is keeping pace. vLLM, Ollama, and llama.cpp are shipping production-ready features for agentic workloads — multi-tier KV offloading, improved cache management, desktop agent interfaces — at a cadence that rivals proprietary solutions.

Fourth, orchestration is becoming agent-aware. Dynamo’s agent hints, LiteLLM’s universal routing, and OpenClaw’s skill framework all point to infrastructure that understands the semantic structure of agent workflows, not just the tokens flowing through them.