Cloudflare has officially entered the large model inference market with a major announcement: Kimi K2.5 is now available on Workers AI, marking the platform’s expansion beyond smaller models into frontier-scale territory.
The move represents a significant shift in Cloudflare’s AI strategy. While Workers AI has served models for two years, the platform historically focused on smaller deployments optimized for edge inference. That changes with Kimi K2.5 — a model boasting a 256,000-token context window, multi-turn tool calling capabilities, vision input support, and structured outputs. This expansion positions Workers AI as a credible alternative to traditional cloud AI providers for demanding enterprise workloads.
The Cost Breakthrough
Cloudflare revealed compelling internal metrics that demonstrate why this matters for production workloads. Their security review agent, which processes over 7 billion tokens daily across codebases, caught more than 15 confirmed vulnerabilities in a single large codebase using Kimi K2.5. The kicker? Running this same workload on a mid-tier proprietary model would cost approximately $2.4 million annually. With Kimi K2.5 on Workers AI, Cloudflare achieved a 77% cost reduction — bringing the annual inference cost down to roughly $550,000.
This cost efficiency stems from custom inference kernels built on Cloudflare’s proprietary Infire inference engine, optimized specifically for the Kimi architecture. The platform implements disaggregated prefill, separating the prefill and generation stages onto different machines to maximize GPU utilization. This architectural choice allows Cloudflare to batch requests more efficiently, reducing idle GPU time and passing those savings to customers.
Platform Enhancements for Agentic Workloads
Alongside the Kimi launch, Cloudflare introduced several features specifically targeting developers building agentic AI applications:
- Prefix caching with token surfacing: Previously invisible cache hits are now exposed as usage metrics with discounted pricing, significantly reducing costs for multi-turn conversations with extensive context. This is particularly valuable for coding assistants and documentation tools that maintain long-running sessions.
- Session affinity header: The new
x-session-affinityheader routes requests to the same model instance, improving cache hit ratios and reducing time-to-first-token latency by up to 40% for repeated interactions. - Redesigned asynchronous APIs: A pull-based queue system processes batched inferences durably without Out of Capacity errors, ideal for non-real-time workloads like code scanning agents, document summarization pipelines, and security analysis tools.
Technical Architecture and Performance
Kimi K2.5 on Workers AI leverages Cloudflare’s global network of 330+ data centers, bringing inference closer to end users than centralized cloud regions can achieve. The model supports function calling, JSON mode for structured outputs, and vision capabilities for multimodal applications. For developers building retrieval-augmented generation (RAG) pipelines, the 256K context window enables processing entire codebases or extensive document collections in a single request.
The platform exposes models through a familiar OpenAI-compatible API, simplifying migration from other providers. Authentication uses standard Cloudflare API tokens, and the serverless pricing model means no capacity planning or reservation management — customers pay only for tokens consumed.
Implications for the AI Infrastructure Market
Cloudflare’s entry into large model serving signals growing market maturity around open-weight alternatives to proprietary APIs. As organizations deploy personal agents running 24/7 and automated systems processing billions of tokens, cost becomes the primary scaling constraint. Workers AI positioning itself as the infrastructure bridge between self-hosted complexity and proprietary pricing could reshape how enterprises approach AI deployment.
For developers and platform teams, this launch represents a genuine third option: the simplicity of API-based inference with economics approaching self-hosted deployments, backed by Cloudflare’s edge infrastructure. The 77% cost savings Cloudflare demonstrated aren’t theoretical — they’re being realized today in production security scanning workloads.
Sources
- Cloudflare Blog — “Powering the agents: Workers AI now runs large models, starting with Kimi K2.5” (March 19, 2026)
- Moonshot AI — Kimi K2.5 model documentation
- Cloudflare Workers AI Documentation
