Can Your NPU Run DOOM? Chimera Can.

"But can it run DOOM?" is the universal litmus test for new silicon, and for good reason. The renderer is a real-time system: raycasting, texture mapping, perspective-correct floor projection, dynamic lighting, depth-buffered sprite compositing, palette-indexed shading. It needs branching, pointer arithmetic, trig, irregular memory access, and complex control flow, all under a hard frame-time deadline.

It requires a computer.

What John Carmack Actually Wrote

The original DOOM engine tells you everything about why this workload is kryptonite for AI accelerators.

Carmack wrote DOOM in ANSI C, with inner loops hand-tuned in x86 assembly. Target hardware was a 386 or 486, machines that often lacked a floating-point unit. The entire engine runs on 16.16 fixed-point integer arithmetic. No floats. No doubles. No FPU. id Software confirmed that DOOM doesn't touch the math coprocessor. Adding a 387 to your 386 did nothing for your framerate.

There are no matrix multiplies in the renderer. Not one. The pipeline is integer adds, shifts, comparisons, table lookups, and the occasional fixed-point multiply. Walls are vertical texture columns indexed by fixed-point coordinates. Floors are horizontal spans with per-row perspective lookups. Lighting is a 256-entry palette remap (the COLORMAP) indexed by distance. Distance calculations skip square roots entirely: Carmack used an octagonal approximation built from absolute values, a comparison, and a shift.

DOOM's computational vocabulary: adds, subtracts, shifts, bitwise ops, comparisons, table lookups, scalar integer multiplies. No GEMMs. No convolutions. No matrix anything. Data-dependent scalar decisions, one after another, through a tight memory access pattern.

If your chip can only do matrix math, this workload will tell you.

What We Built on the GPNPU

For our demo, we implemented a raycasting-based DOOM-style renderer on Quadric. Original DOOM used a BSP-based software renderer rather than pure raycasting, but our version preserves several of the workload characteristics that matter here: irregular control flow, texture sampling, and fixed-point math.

We compiled the renderer as a single kernel targeting the Quadric Chimera GPNPU. Every pixel of a 224×168 frame is computed on-chip, in one shot. No host-side rendering. No frame decomposition across a CPU.

We wrote a renderer that speaks the same language DOOM speaks: integer ALU, computed memory access, data-dependent control flow. Chimera is fluent in all of it. The kernel completes a full 224×168 frame in 560K cycles on QC-N. At silicon clock speeds of 1GHz, that's ~1,785 frames per second. At 3nm, the entire chip runs at under 1W.

Running live on an FPGA prototyping the QC-N (Nano) — Quadric's edge-class GPNPU.

Why Other AI Accelerators Can't Even Attempt This

Most AI accelerators are systolic arrays bolted to a DMA engine. They stream matrix multiplies through a fixed datapath. Fine for GEMM-dominated CNN and transformer layers. But a raycasting renderer has zero matrix multiplies.

What it has instead: data-dependent branching where every ray terminates at a different step. Computed memory accesses where texture coordinates come from runtime fixed-point math. Scalar logic for depth comparison, transparency tests, conditional overwrites. Producer-consumer dependencies between the column prepass and the tile renderer.

A typical NPU has no instruction for "branch on wall hit." No mechanism to compute a texture address at runtime and load one byte from an arbitrary SRAM location. The hardware can't do it, the toolchain can't express it, and operator fusion won't save you. This workload was never on the accelerator to begin with.

On-Chip Memory Autonomy

Our kernel loads the entire rendering context (map, three texture atlases, colormap, column cache, sprite metadata, framebuffer) into under a megabyte of L2 memory. The GPNPU computes addresses and issues single-element loads through the RAU (Random Access Unit). No DMA descriptors, no host intervention, no round-trips to DRAM mid-render.

Most NPU architectures treat on-chip SRAM as a scratchpad managed by a host-programmed DMA controller, tile by tile, layer by layer. A computed lookup into a 512×1024 texture atlas at a runtime-calculated address? Not something the hardware supports.

The Mega-Kernel

The entire renderer compiles and executes as a single GPNPU kernel invocation: raycasting, wall texturing, floor/ceiling projection, lighting, sprite compositing, framebuffer writeback. Chimera's programming model gives you C++ through CCL (Chimera Compute Language), compiled by our LLVM backend targeting the GPNPU. You write C++ with SIMD semantics, fixed-point arithmetic, on-chip memory allocation, and random-access loads. The compiler generates native GPNPU machine code from a real ISA.

No operator library to be limited by. No "unsupported op" error. If you can write it in C++, the GPNPU can run it. Try expressing a DDA raycaster as a sequence of Conv2D and MatMul ops. We'll wait.

Inside the Kernel

Here's what some of this actually looks like in CCL.

Branchless DDA Raycasting. The raycaster doesn't branch on wall hits. Instead, it computes all 32 candidate map cell addresses in a pure ALU phase, fetches them in a single batched RAU load, then finds the first hit via conditional-move accumulation:

// Phase 1: pure ALU — collect all 32 candidate addresses
for (std::int32_t step = 0; step < MAX_STEPS; ++step) {
    auto stepInX = (sideDistX < sideDistY);
    mapX = stepInX ? (mapX + stepDirX) : mapX;
    mapY = stepInX ? mapY : (mapY + stepDirY);
    crossDists[step] = stepInX ? sideDistX : sideDistY;
    hitSides[step]   = stepInX ? (std::int32_t)0 : (std::int32_t)1;
    sideDistX = stepInX ? (sideDistX + deltaDistX) : sideDistX;
    sideDistY = stepInX ? sideDistY : (sideDistY + deltaDistY);
    // ... clamp and compute L2 address for map[mapY][mapX]
    addrs[step] = rau::computeAddr<OcmMap>(z0, z0, gy, gx);
}

// Phase 2: one batched RAU load for all 32 cells
rau::config(ocmMap);
rau::load::tiles(addrs, cells, ocmMap, MAX_STEPS);

// Phase 3: branchless first-hit scan
for (std::int32_t step = 0; step < MAX_STEPS; ++step) {
    auto wallHit  = (cells[step] != 0) | (!inBounds);
    auto firstHit = wallHit & (!hit);
    perpDist = firstHit ? crossDists[step] : perpDist;
    hitMeta  = firstHit ? cells[step]       : hitMeta;
    hit = hit | firstHit;
}

Every lane in the SIMD array casts its own ray through a different screen column. Every ray hits a different wall at a different step. The control flow is uniform. No divergence, no serialization. A GEMM engine has no vocabulary for this.

SIMD-Predicated Sprite Compositing. When compositing sprites, the kernel checks whether any SIMD lane in the current tile actually intersects the sprite before paying for a texture fetch:

auto inSprite = sprActive &
                (myX >= left) & (myX < right) &
                (myY >= top)  & (myY < bottom);

if (!chimera::anyOf(inSprite)) continue;

// Only now do we fetch the texel and shade it
addrArr[0] = spriteTexelAddrFromBase(ocmSpriteTex, atlasBaseX, atlasBaseY, texU, texV);
rau::config(ocmSpriteTex);
rau::load::tiles(addrArr, texelArr, ocmSpriteTex, 1);

anyOf() reduces across the entire core array: does any core in this tile care about this sprite? If not, skip the texture fetch, colormap lookup, and depth test. Per-sprite, per-tile, zero wasted memory traffic.

Tile Rendering with Predication. The screen is decomposed into tiles matching the Chimera core array. Tiles entirely above or below the horizon skip the irrelevant plane. Only tiles straddling the horizon pay for both floor and ceiling. Wall pixels are textured from a pre-computed column cache. Four directional lights are dotted against each wall's surface normal, summed with a distance fog term, and applied per-pixel in fixed-point, inline in the kernel.

So What?

Carmack built DOOM to run on the cheapest integer hardware of 1993. Three decades and billions of R&D dollars later, most AI accelerators still can't handle the same class of workload.

Chimera can. Same chip that runs transformer attention and quantized convolutions at full throughput, running a raycasting engine. It was designed as a programmable computer from the start.

Your NPU runs neural networks. Ours runs programs.

And yes, it runs DOOM. IDKFA.

The Quadric GPNPU (General-Purpose Neural Processing Unit) is built on the Chimera architecture with native C++ programmability via CCL and a custom LLVM backend. To learn more about deploying real workloads on Chimera silicon, contact us.