Agentic AI has crossed a threshold in mid-2026. What began as an experimental paradigm — LLMs that reason, plan, and execute across multiple steps — is now the dominant mode of productive AI work. The evidence is everywhere: OpenAI’s internal workforce has shifted virtually all AI usage from chatbots to agents. New hardware is being purpose-built for agentic inference. Benchmarks are emerging that measure not just answer quality, but the full trajectory of multi-step agent execution. And open-source models are closing the gap on frontier closed systems for long-horizon tasks.
Here is what has happened across the agentic AI landscape in recent weeks, and why it matters for how software gets built.
OpenAI’s Workforce Has Gone Agent-Native
The most telling signal of where agentic AI is headed comes from OpenAI itself. In a detailed blog post, the company revealed that Codex — its coding and agentic work environment — now accounts for 99.8% of weekly output tokens generated within OpenAI. Every department, including non-technical teams like Legal, Finance, and Recruiting, has switched to Codex as their primary AI tool.
The adoption curve tells the story. Engineers began migrating first, with the average OpenAI engineer shifting the majority of their AI usage to Codex by December 2025. Legal, finance, and recruiting crossed the same threshold around April 2026, but their transitions were far more abrupt. Since August 2025, non-developer usage among individual users has grown 137x, organizational non-developer usage has grown 189x, and even within OpenAI — where adoption was already high — non-developer Codex usage increased 12x.
Perhaps more striking is what people are asking agents to do. By May 2026, 80.6% of sampled individual users made at least one Codex request estimated to exceed 30 minutes of human work. 70.2% made one exceeding one hour. And 25.6% made at least one request estimated to exceed eight hours of human effort. At the 99th percentile, daily active users at OpenAI were running more than 60 hours of Codex agent turns per day — distributed across multiple parallel agents.
The implication is clear: agents are no longer augmenting individual tasks. They are becoming the primary substrate for knowledge work itself.
GPT-5.6 Sol Raises the Frontier
On June 26, OpenAI previewed GPT-5.6 Sol, its next-generation flagship model, alongside Terra (a balanced, 2x cheaper model) and Luna (fast and affordable). Sol launches with what OpenAI calls its “most robust safety stack to date,” including real-time misuse classifiers, layered safeguards, and differentiated access tiers.
Capability-wise, GPT-5.6 Sol introduces a new “ultra mode” that leverages subagents to accelerate complex work — effectively deploying multiple agents in parallel to solve problems that would overwhelm a single model instance. It sets a new state of the art on Terminal-Bench 2.1, which tests command-line workflows requiring planning, iteration, and tool coordination. On GeneBench v1, which evaluates long-horizon genomics analyses, it outperforms GPT-5.5 while using fewer tokens.
Notably, the preview is being rolled out under a limited-access framework coordinated with the U.S. government. OpenAI stated that it does not believe this government-access process should become the long-term default, but is taking the short-term step to work toward broader availability while the Administration develops a cyber Executive Order framework for future model releases.
OpenAI Designs Its Own Silicon: The Jalapeño Chip
Days before the GPT-5.6 preview, OpenAI and Broadcom unveiled Jalapeño — OpenAI’s first custom Intelligence Processor, designed from the ground up for LLM inference. Unlike general-purpose accelerators adapted for AI workloads, Jalapeño was architected around OpenAI’s understanding of kernels, memory movement, networking, and serving patterns specific to frontier models.
Early testing indicates performance per watt “substantially better than current state-of-the-art.” The architecture reduces data movement and balances compute, memory, and networking resources to achieve realized utilization closer to theoretical peak. Broadcom’s Tomahawk networking silicon and Celestica’s production systems support large-scale deployment. Engineering samples are already running ML workloads in the lab, including GPT-5.3-Codex-Spark.
The companies plan gigawatt-scale deployment with data center partners beginning in 2026, marking a significant expansion of OpenAI’s full-stack strategy from products and models down to custom silicon.
NVIDIA Defines How to Benchmark Agentic Workloads
While models get the headlines, the infrastructure underneath them is evolving just as rapidly. NVIDIA recently achieved leading performance on AA-AgentPerf — the industry’s first multi-vendor open benchmark for agentic AI hardware. Created by Artificial Analysis, AgentPerf measures how many concurrent agents an inference system can support while meeting service-level objectives for token speed and time-to-first-token.
The benchmark captures something previous metrics missed: the non-deterministic nature of agent trajectories. Real agents interleave reasoning with tool calls, producing variable-length sequences that stress both GPU compute and memory bandwidth. AA-AgentPerf uses prerecorded agentic coding trajectories with interleaved reasoning and tool use, simulating realistic CPU-side tool-call delays.
The results are dramatic. NVIDIA GB300 NVL72 delivers up to 20x more concurrent agents per megawatt than the previous-generation H200. At SLO #30 (30 tokens per second, 10-second TTFT), GB300 NVL72 supports 61,400 concurrent agents per megawatt versus H200’s 2,600. This performance comes from extreme co-design: WideEP and DeepEP optimizations spreading MoE expert execution across the full NVL72 domain, DeepGEMM and Mega MoE fused kernels, and NVLink linking 72 GPUs into a single high-bandwidth fabric.
NVIDIA Dynamo Optimizes the Agentic Inference Stack
Complementing the hardware gains, NVIDIA’s Dynamo project is building agent-native software infrastructure. The core insight is that agentic workloads create a write-once-read-many (WORM) KV cache pattern: the system prompt and conversation prefix are computed once, then read from cache on every subsequent API call.
Real numbers from production agent systems illustrate the scale. Claude Code achieves 85-97% cache hit rates per call after the first request. Agent teams (or swarms) push this to 97.2% aggregate cache hit rate across four Opus teammates. The read-to-write ratio reaches 11.7x — nearly 12 cache reads for every token written. Maximizing cache reuse across workers and keeping KV blocks warm and routable is now the central optimization target for agentic inference.
Dynamo addresses this at three layers: frontend API support (handling v1/responses, v1/messages, and v1/chat/completions through a common representation), intelligent routing and scheduling, and KV cache management optimized for the WORM pattern. NVIDIA is already running Dynamo deployments of GLM-5 and MiniMax2.5 internally to power Codex and Claude Code harnesses, benchmarking against closed-source inference providers.
Mistral Ships Remote Cloud Agents
Mistral AI is taking a different angle: moving agents from local machines to the cloud. In mid-June, the company launched remote agents in Vibe, powered by its new Mistral Medium 3.5 model — a dense 128B model with a 256k context window that merges instruction-following, reasoning, and coding into a single set of weights.
The remote agent model is compelling. Coding sessions run asynchronously in the cloud, can be spawned from the Vibe CLI or directly in Le Chat, and continue working while developers step away. Multiple agents can run in parallel. Local CLI sessions can be “teleported” up to the cloud, carrying session history and task state. When work completes, the agent can open a pull request on GitHub and notify the user via Slack or Teams.
Mistral Medium 3.5 scores 77.6% on SWE-Bench Verified, placing it ahead of Devstral 2 and Qwen3.5. It is released as open weights under a modified MIT license and can self-host on as few as four GPUs. The company also introduced a new “Work mode” in Le Chat — a powerful agentic mode for complex multi-step tasks like research, analysis, and cross-tool workflows, powered by a new harness and the same model.
GLM-5.2 Pushes Open-Source Context to 1M Tokens
The open-source frontier is not standing still. Zhipu AI’s GLM-5.2 introduces a solid 1M-token context window with MIT licensing — no regional restrictions. The model was explicitly trained for long-horizon coding-agent scenarios: large-scale implementation, automated research, performance optimization, and complex debugging.
On FrontierSWE — which measures open-ended technical projects spanning hours to tens of hours — GLM-5.2 trails only Claude Opus 4.8 by 1%, while edging out GPT-5.5. On PostTrainBench, where agents must improve small models through post-training, it ranks second only to Opus 4.8. GLM-5.2 introduces configurable effort levels, allowing users to balance capability against latency and cost. At 81.0 on Terminal-Bench 2.1, it is the strongest open-source model and within a few points of the closed-source frontier.
The architecture uses IndexShare, which reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9x at 1M context length. The MTP layer is optimized for speculative decoding with up to 20% increased acceptance length.
IBM’s CUGA: A Harness, Not a Framework
While models compete on benchmarks, IBM Research released CUGA (Configurable Generalist Agent) — an open-source agent harness designed to handle the orchestration, planning, reflection, and state management so developers can focus on task logic. CUGA topped AppWorld and WebArena benchmarks from 2025 through early 2026.
The key insight is that most agent frameworks force developers to rebuild orchestration plumbing for every application. CUGA inverts that: you define tools and prompts, and the harness handles planning, CodeAct execution, reflection, variable tracking across long runs, and multi-agent delegation over A2A. It supports interchangeable tools (OpenAPI, MCP, LangChain), declarative guardrails, and one-environment-variable provider switching across OpenAI, watsonx, Ollama, and more. The team built two dozen single-file working apps to prove the approach, from movie recommenders to IBM Cloud architecture advisors.
Hugging Face Benchmarks Tooling for Agents
As agents become the primary consumers of software libraries, the libraries themselves must evolve. Hugging Face researchers published a study on benchmarking open models on real tooling, measuring not just whether agents produce correct answers but how much work they do to get there.
Their benchmark, built on the pi coding agent and fanned out across Hugging Face Jobs for identical hardware, revealed that small API improvements can dramatically reduce agent token usage. When the hf CLI was redesigned to be agent-optimized, agents used 1.3-1.8x fewer tokens — and up to 6x in some cases. The implication is that library design is entering a new era where “agent-optimized” is as important as “human-optimized.” APIs must be discoverable, documentation must be structured for agent consumption, and error messages must be actionable enough for autonomous recovery.
What This Means for the Stack
Several converging trends define the agentic AI landscape in mid-2026:
- Agents are becoming the default interface. Chatbots are not disappearing, but the most valuable AI work is increasingly delegated to long-running agents that operate across minutes, hours, or even days.
- Hardware is being redesigned for agents. From OpenAI’s Jalapeño to NVIDIA’s GB300 NVL72, custom silicon and extreme co-design are delivering order-of-magnitude gains for agentic inference.
- Open-source models are catching up. GLM-5.2 and Mistral Medium 3.5 demonstrate that open weights can compete with closed frontier models on long-horizon tasks, with the added benefit of self-hosting and no usage restrictions.
- Benchmarks are maturing. AA-AgentPerf, Terminal-Bench, FrontierSWE, and agent-specific tooling benchmarks are creating a more nuanced picture of what “capable” means in the agentic era.
- The infrastructure stack is splitting into layers. Harnesses (Claude Code, Codex, OpenClaw) drive workflows. Orchestrators (Dynamo, CUGA) handle routing and cache management. Runtimes (SGLang, vLLM, TensorRT LLM) execute models and manage KV caches.
The next phase of AI infrastructure will not be measured by how well a single model answers a single prompt. It will be measured by how many agents can run concurrently, how long they can sustain context, how efficiently they share state, and how reliably they complete multi-hour tasks without human intervention.
That future is arriving faster than many expected.
Sources
- OpenAI: How agents are transforming work
- OpenAI: Previewing GPT-5.6 Sol
- OpenAI and Broadcom: Jalapeño Inference Chip
- NVIDIA: AgentPerf Benchmark Results
- NVIDIA: Full-Stack Optimizations for Agentic Inference with Dynamo
- Mistral: Remote Agents in Vibe
- Zhipu AI: GLM-5.2
- IBM Research: CUGA Agent Harness
- Hugging Face: Benchmarking Open Models on Tooling
- NVIDIA: DFlash Speculative Decoding
- NVIDIA: AI-Q Blueprint on OCI
