OpenAI Builds Its Own Chip, NVIDIA Hits 15x Inference Speedup, and an 18-Year-Old Bug Gets Squashed

This week in AI infrastructure, the biggest story isn’t a model release — it’s a chip. OpenAI unveiled Jalapeño, its first custom AI accelerator, marking a strategic pivot from GPU tenant to silicon designer. NVIDIA wasn’t sitting still either, shipping DFlash speculative decoding that hits 15x on Blackwell, TensorRT 11.0 with native multi-GPU inference, and Dynamo — a new serving stack purpose-built for agentic workloads. And in a masterclass of infrastructure debugging, OpenAI engineers used large-scale core dump analysis to squash an 18-year-old bug hiding in GNU libunwind.

OpenAI’s Jalapeño: A Blank-Slate Inference Accelerator

OpenAI and Broadcom unveiled Jalapeño on June 24, and it’s not a repurposed GPU or a lightly modified AI accelerator. It’s a from-scratch ASIC designed explicitly for LLM inference — what OpenAI calls an “Intelligence Processor.” The company claims early testing shows “performance per watt substantially better than current state-of-the-art,” with a detailed technical report promised in the coming months.

The numbers that matter: nine months from design to tape-out. That’s being called the fastest ASIC development cycle in high-performance semiconductors, accelerated by — fittingly — OpenAI’s own models helping with parts of the design and optimization process. Broadcom handled silicon implementation, networking (including Tomahawk silicon), and board integration, while Celestica contributed rack and system expertise.

Why inference specifically? Because inference is where AI reaches users. Training gets the headlines, but inference is the recurring cost that scales with adoption. Greg Brockman, OpenAI’s president, framed it directly: “By designing more of the stack ourselves, we can serve more intelligence with greater efficiency and keep pushing advanced AI toward broader access.”

The architecture targets what OpenAI calls the “full-stack advantage.” Because the company controls the model, the kernels, the serving system, and now the silicon, each layer can be optimized around the same goal: reducing latency and cost for interactive LLM products. Jalapeño is designed to combine “the power and throughput of today’s leading AI accelerators with latency closer to the fastest specialized inference systems.”

The deployment roadmap is aggressive. Engineering samples are already running GPT-5.3-Codex-Spark in the lab. Initial deployment begins late 2026, scaling to gigawatt-class data centers with Microsoft and other partners over multiple generations. That’s city-scale power consumption — and it signals OpenAI’s bet that inference demand will grow faster than general-purpose GPU efficiency can keep up.

For infrastructure engineers, the implications are clear: the major AI labs are no longer content to rent compute. They’re building their own. Jalapeño is the first in a multi-generation platform, and it puts OpenAI in the same conversation as Google (TPU), Amazon (Trainium/Inferentia), and Meta (MTIA). The era of one-size-fits-all GPU inference is ending.

NVIDIA’s Inference Acceleration Blitz

While OpenAI builds its own chips, NVIDIA is making existing hardware go faster. Three major releases this week target the inference bottleneck from different angles.

DFlash: 15x Faster on Blackwell

NVIDIA published benchmarks for DFlash, a block-diffusion speculative decoding model that serves as a lightweight “drafter” for larger target models. On an eight-GPU DGX B300 system running gpt-oss-120b, DFlash delivered over 15x throughput improvement at high interactivity targets (500-600 tokens/sec per user) compared to autoregressive decoding — and 1.5x better than EAGLE-3, the previous state of the art.

At batch size 1, DFlash more than doubles interactivity. The technique works by generating blocks of candidate tokens in parallel rather than one at a time, then having the target model verify them. The research team has released 20 checkpoints on Hugging Face covering Qwen, Kimi K2.6, Llama, Gemma, and gpt-oss families, with integrations for SGLang and vLLM.

TensorRT 11.0: Native Multi-GPU Inference

NVIDIA TensorRT 11.0 introduced multi-device inference support, bringing native high-performance multi-GPU acceleration to the TensorRT runtime. The integration uses NCCL for transport across NVLink, NVSwitch, PCIe, and InfiniBand, and supports both tensor parallelism (sharding layer weights across GPUs) and context parallelism (partitioning sequences across devices).

Context parallelism is particularly relevant for diffusion and DiT models, where bidirectional attention over long sequences is the dominant cost. Combined with Torch-TensorRT, developers can convert massive PyTorch models out-of-framework and deploy them across multiple GPUs without losing the kernel fusions and quantization optimizations that make TensorRT fast.

Dynamo: Purpose-Built for Agentic Workloads

NVIDIA’s Dynamo serving framework is getting agent-native features. The big insight: coding agents like Claude Code and Codex generate workloads with 85-97% KV cache hit rates and 11.7x read/write ratios. That’s a WORM (write-once-read-many) pattern — and Dynamo is being redesigned around it.

New features include “agent hints” (structured metadata from agent frameworks that let the router optimize scheduling), cache control with TTL pinning, speculative prefill warming, and multi-protocol support for v1/responses and v1/messages APIs. For teams running open-source models on their own GPUs, Dynamo is pitched as the missing orchestration layer that closed API providers already have.

OpenAI Engineers Debug an 18-Year-Old Linux Bug

In a fascinating engineering post, OpenAI described how its infrastructure team tracked down seemingly impossible crashes in the Rockset service (the data system acquired in 2024 that powers ChatGPT’s data plugins and conversation search). The crashes were bizarre: functions would return to NULL addresses, or the stack pointer would mysteriously shift by 8 bytes mid-execution.

The breakthrough came from treating the problem epidemiologically rather than debugging individual cores. By building a high-quality dataset of the entire crash population, they identified two separate issues: silent hardware corruption on one Azure host where the CPU simply failed at arithmetic, and — more surprisingly — an 18-year-old race condition in GNU libunwind, a widely-used open-source library for stack unwinding.

The libunwind bug manifested during signal delivery in C++ services, corrupting saved return addresses under specific timing conditions. The team built tooling to classify crashes at scale, distinguishing hardware faults from software bugs by looking at failure patterns across the fleet. The fix has been upstreamed. It’s a reminder that at hyperscale, even ancient, battle-tested libraries harbor edge cases that only surface when you’re running enough instances to hit the statistical tail.

The Open Source Serving Stack Keeps Maturing

vLLM 0.24.0

The vLLM project shipped version 0.24.0 with 571 commits from 256 contributors. Highlights include MiniMax-M3 support with BF16/FP8 indexing, MXFP4 support, FP8 sparse GQA, extensive AMD/ROCm tuning for MI300X and gfx950, and continued DeepSeek-V4 optimization passes including a FlashInfer sparse index cache and prefill chunk-planning that improves end-to-end throughput by 4%.

Hugging Face Jobs + vLLM

Hugging Face introduced one-command vLLM deployment on its Jobs infrastructure. Run hf jobs run --flavor a10g-large --expose 8000 with the official vLLM image and you get a private, OpenAI-compatible endpoint billed per second. It scales from single-GPU Qwen3-4B to tensor-parallel Qwen3.5-122B on 2x H200. For teams that need ad-hoc inference without managing Kubernetes clusters, it’s a compelling middle ground between self-hosting and managed Inference Endpoints.

Ollama 0.31.1: Gemma 4 on Apple Silicon

Ollama’s v0.31.1 release focuses on speed for Gemma 4 on Apple Silicon, leveraging multi-token prediction (MTP) for up to 90% faster token generation across coding benchmarks. The speedup is automatic — Ollama auto-tunes draft token count at runtime without changing model output. The release also bumps the MLX engine and underlying llama.cpp build.

For local inference enthusiasts, it’s another data point that Apple Silicon is becoming a serious platform for running frontier models — not just for demos, but for actual agentic workloads.

Sources: OpenAI, Broadcom, NVIDIA Developer Blog, TechCrunch, CNBC, VentureBeat, Tom’s Hardware, vLLM Project, Hugging Face, Ollama