Cloudflare Workers AI Now Runs Large Models: Kimi K2.5 Lands on the Edge

Cloudflare is officially entering the frontier model race with a significant announcement that expands its AI platform beyond small, efficient models into the territory of large-scale open-source LLMs. The company revealed that Workers AI now supports large frontier models, starting with Moonshot AIs Kimi K2.5. This marks a strategic pivot for Workers AI, which has historically prioritized smaller, faster models optimized for edge deployment. The move comes with compelling economics that could reshape how organizations think about AI inference infrastructure.

The Economics That Caught Cloudflares Attention

For the past two years, Workers AI maintained a disciplined focus on smaller models that could run efficiently on its global edge infrastructure. The reasoning was sound: most use cases didnt need frontier-scale reasoning, and smaller models offered better latency and cost profiles. But the landscape has shifted dramatically as open-source models like Kimi K2.5 began offering frontier-level capabilities at a fraction of the cost of proprietary alternatives.

Cloudflares internal testing reveals exactly why this shift matters at scale. The company runs an internal security review agent that processes over 7 billion tokens daily across its codebases. Running this workload on a mid-tier proprietary model would cost an estimated $2.4 million annually for that single use case. Switching to Kimi K2.5 on Workers AI reduced that cost to just 23% of the original—a 77% cost reduction that fundamentally changes the economics of AI-at-scale.

This isnt just about Cloudflares internal tooling. The company points to the rise of personal agents like OpenClaw—always-on AI assistants that individuals run continuously. When every employee has multiple agents processing hundreds of thousands of tokens per hour, the cost math for proprietary models breaks down quickly. Organizations need alternatives, and Workers AI is positioning itself as the platform for this transition.

Technical Specifications and Capabilities

The Kimi K2.5 deployment on Workers AI brings substantial technical capabilities that make it suitable for serious agentic workloads:

256K context window—enabling processing of extensive codebases, long documents, and complex multi-turn conversations without truncation
Native tool calling with support for multi-turn tool interactions and structured JSON outputs
Vision input support for multimodal agent workflows that can process and reason about images
Session affinity headers (x-session-affinity) that route requests to the same model instance for improved cache performance

These specifications position Kimi K2.5 as a legitimate alternative to proprietary frontier models for agent development, code review, security analysis, and complex reasoning tasks.

Platform Innovations for Agentic Workloads

Beyond the model itself, Cloudflare is introducing several platform-level improvements specifically designed for agentic workloads:

Prefix Caching: Workers AI has always cached prefixes, but now the platform surfaces cached token metrics and offers discounted pricing on cache hits. This is critical for agent use cases where the same system prompt, tool definitions, and context documents are sent repeatedly. With context windows up to 256K tokens, avoiding repeated prefill computation on identical prefixes saves substantial time and money.

Session Affinity Headers: The new x-session-affinity header routes requests from the same session to the same model instance. This dramatically improves the probability of cache hits across multi-turn conversations, reducing both time-to-first-token latency and overall costs. Cloudflares Agents SDK starter already implements this automatically.

Async API Redesign: The revamped asynchronous inference API uses pull-based queuing instead of the historical push-based system. This means async requests are processed as capacity becomes available rather than being dropped during high load. For non-real-time workloads—code review agents, research agents, batch processing—this provides durable execution without capacity errors. Internal testing shows async requests typically complete within 5 minutes.

The Inference Stack: Under the Hood

Serving a model like Kimi K2.5 at scale requires significant infrastructure investment. Cloudflare developed custom kernels specifically for this model, building on top of its proprietary Infire inference engine.

The optimizations include disaggregated prefill—separating the input processing phase (prefill) from token generation onto different hardware. This maximizes GPU utilization by allowing the prefill stage to process large context windows while generation proceeds in parallel on separate resources. Its a technique used by frontier model providers to achieve efficient throughput on large models.

Cloudflare emphasizes that achieving this level of performance requires dedicated ML engineering expertise that most organizations lack. The platform abstracts this complexity, offering the benefit of these optimizations through a simple API call rather than requiring teams to build and maintain custom inference infrastructure.

Developer Integration and Access

Developers can access Kimi K2.5 through the standard Workers AI API. The model is available through Cloudflares developer platform with per-token pricing that reflects the platforms cost advantages. For teams already using Workers AI, the transition is transparent—the same API works for both small models and the new large model offerings.

The combination of edge deployment, aggressive pricing, frontier model quality, and purpose-built agent features positions Workers AI as a compelling alternative to proprietary AI platforms. With personal agents becoming standard infrastructure and organizations running billions of tokens daily, the cost savings could be transformative.

Sources

Cloudflare Blog: Powering the agents: Workers AI now runs large models, starting with Kimi K2.5 (March 19, 2026)
Developers.cloudflare.com Workers AI documentation