Essay

From Transistor to Token

March 12, 2026 · 30 min read

apple-siliconinferenceneural-enginemlxon-device-ai

01 — THE MARKETING NUMBER

The marketing number

Apple’s M4 Apple Neural Engine is marketed at 38 TOPS. A reverse-engineering project called maderix, with 5,300 GitHub stars and growing, measured 19 TFLOPS. The reconciliation is straightforward: Apple counts INT8 operations at twice the FP16 rate, following industry convention. But the hardware dequantizes INT8 inputs to FP16 before compute. The doubling is notational, not physical.

Call it a lens: a forensic approach to the stack. At every layer of the inference stack, from silicon to framework to system to model to application, the number you are given and the architecture underneath it diverge. Understanding on-device AI capability requires seeing through three layers of abstraction where presentation and architecture come apart. The gap is not deception. It is the distance between a spec sheet optimized for comparison shopping and an architecture optimized for computation.

The ANE as convolution engine

The most fundamental finding from the maderix reverse engineering: the Apple Neural Engine does not execute individual instructions. It accepts a compiled neural network graph and runs the entire thing atomically. There is no instruction-level parallelism to exploit, no pipeline to stall, no branch predictor to confuse. You hand it a graph; it runs the graph. This has a consequence that no benchmark captures. Expressing matrix multiplication as 1×1 convolution yields 3× throughput — because the hardware IS a convolution engine, and matmul is the shim. The hardware was not designed to perform arbitrary neural computation flexibly. It was designed to perform specific neural computations extremely efficiently.

The ANE was built for the workloads Apple needed in 2017: Face ID at the moment you glance at your phone, computational photography operating on every frame before you see it, Live Text recognizing characters in a camera feed. Dense, structured inference at zero watts idle. The key phrase is “at zero watts idle” — when you are not using Face ID, the ANE draws no power at all. For these workloads, the ANE is not merely good. It is arguably the most efficient neural compute hardware ever deployed at consumer scale. Making invisible structures visible starts here — at the silicon layer where the marketing number and the architectural reality diverge.

What the ANE was not designed for: token-by-token autoregressive decode, where each token generation requires a separate pass through the model weights. The graph-atomic execution model means there is no way to stream partial results or interleave generation with other work. Each token is a complete graph execution, and the dispatch overhead that is negligible for a camera pipeline processing 30 frames per second becomes dominant when you need hundreds of dispatch calls per second.

The architecture has a hard boundary. Below 32 MB of working set, the ANE’s on-chip SRAM delivers peak throughput. Above it, performance degrades by roughly 30% as data spills to DRAM. This SRAM cliff shapes what the ANE can and cannot do efficiently, invisible in every published specification.

Sixteen neural engine cores are present in every M5 variant, unchanged from M4. Apple did not scale the ANE. They scaled something else entirely.

Power efficiency at the silicon layer

Before the scaling story, the efficiency story. At 2.8 watts peak power, the M4 ANE achieves 6.6 TFLOPS/W. The M4 base GPU manages 1.0 TFLOPS/W. For datacenter comparison: the A100 delivers 0.08 TFLOPS/W, and the H100 0.13 TFLOPS/W. The ANE is 50–80× more energy-efficient per floating-point operation than the best datacenter GPUs on the planet.

Not marketing. Physics. But physics applied to a specific computation pattern, accessed through CoreML, which wastes most of it.

CoreML, Apple’s public machine learning framework, adds 2–4× overhead for small operations compared to direct ANE access. The dispatch floor is roughly 0.095 milliseconds per operation, which means that for a 256×256 matrix multiplication taking 0.006ms of actual ANE compute, CoreML’s XPC and IOKit overhead consumes 94% of wall-clock time. There is a 119-compile-per-process ceiling. The scheduler is a black box: the Orion paper (arXiv:2603.06728) cataloged 20 ANE restrictions, 14 of them previously undocumented. Developers cannot force ANE execution, cannot inspect ANE programs, and cannot perform gradient computation on the hardware. If a single layer in a model is not ANE-compatible, CoreML may fall the entire model back to CPU.

The hardware is extraordinary. The software surface is the bottleneck.

The doubling is notational, not physical.

···

The private API question

Beneath CoreML’s public surface lie more than 40 private Objective-C classes. _ANEClient bypasses CoreML entirely. In-memory model descriptors enable runtime program generation without the compile-to-disk pipeline that CoreML requires. The maderix project demonstrated training on what Apple markets as inference-only hardware — a proof of concept showing that the gap between “inference accelerator” and “general compute” is narrower than Apple’s API surface suggests. The autoresearch-ANE project operationalized this proof of concept into an autonomous experiment loop, Karpathy’s autoresearch ratchet running directly on the Neural Engine via _ANEClient and IOSurface. A dynamic weight pipeline packs weights into IOSurface inputs alongside activations; kernels compile once at startup, and subsequent weight updates require only memory copying rather than recompilation. A 67.6M-parameter GPT model trains at 99ms per step on the ANE, 8× faster than the MPS path on the same silicon.

But building on private APIs is building on sand. Apple can change internal interfaces without notice, and any application relying on _ANEClient is ineligible for the App Store. The private API surface proves capability; it does not constitute a development platform. The engineering reality is a hardware accelerator whose measured performance is world-class and whose accessible performance is mediated by CoreML, which imposes order-of-magnitude overhead on the operations most relevant to modern AI workloads.

The tension is structural, not incidental. Apple designed the ANE for its own consumption (the camera pipeline, Siri, on-device dictation) and exposed CoreML as a managed interface that prioritizes system stability over developer control. For Apple’s own workloads, where the full model graph is known at compile time and optimized end-to-end, the ANE is exactly right. For the open-ended, rapidly iterating world of LLM inference, CoreML becomes the constraint.

What Apple says

TOPS

INT8

What the hardware does

TFLOPS

FP16

INT8 inputs are dequantized to FP16 before compute. The 2× is notational.

Compute efficiency (TFLOPS/W)

M4 ANE

6.6

M4 GPU

H100 GPU

0.13

A100 GPU

0.08

The ANE is 50–80× more efficient per FLOP than datacenter GPUs. But CoreML's 2–4× overhead means developers access only a fraction of it.

The ANE tells you what Apple’s silicon CAN do. The M5 tells you what Apple has decided it SHOULD do.

02 — THE SYSTEM

The system

The M5 is an architectural statement disguised as a spec bump. Apple’s Fusion Architecture bonds two 3nm dies into a single system-on-chip. Die 1 carries the CPU, Neural Engine, and I/O controllers. Die 2 carries the GPU, media engines, and memory controllers. The M5 Max doubles Die 2: 40 GPU cores, 614 GB/s of memory bandwidth.

The architectural pivot is not in the die count. It is in what lives inside the GPU cores. Every M5 GPU core contains a Neural Accelerator, dedicated neural compute hardware wired directly into the graphics pipeline and programmable via Metal 4’s new Tensor APIs (MTLTensor, MTL4MachineLearningCommandEncoder, Metal Performance Primitives). AI compute now scales with GPU core count: 10 Neural Accelerators on the M5, 20 on the M5 Pro, 40 on the M5 Max. The fixed-size sidecar era is over: neural compute now grows with the chip.

Apple’s own marketing tells the story. The M5 press release mentions the 16-core Neural Engine exactly once, in a single sentence tied to “Apple Intelligence” consumer features. The Neural Accelerators in GPU cores get the performance claims, the LM Studio name-drop, and the explicit connection between 614 GB/s bandwidth and “higher token generation for LLMs.” The unit of measurement itself shifted: no TOPS figure at all (the metric Apple used from M1 through M4), replaced by “over 4× peak GPU compute for AI.”

The measurement changed because the hardware target changed. And the shift reveals Apple’s strategic bet: the future of on-device AI compute is not a dedicated accelerator sitting idle between camera activations. It is neural compute woven into the GPU fabric, scaling with the silicon, programmable through Metal’s existing developer ecosystem.

This is a bet against Apple’s own prior architecture. The ANE was the answer to “how do we run neural networks on phones without killing the battery?” The GPU Neural Accelerators are the answer to a different question: “how do we make a laptop the primary inference platform for models that did not exist when the ANE was designed?”

Two regimes of inference

This is the conceptual hinge of the entire system-level story, and it is the distinction that most benchmark summaries collapse into a single number.

Prefill is the first phase of LLM inference: processing the entire input prompt at once. It is compute-bound. More FLOPS means faster prefill. The M5 Max’s Neural Accelerators deliver 3.5–4× improvement over the M4 Max. A 10,000-token prompt that took 81 seconds on M4 Max drops to 18 seconds on M5 Max, according to MacStories benchmarks. This transforms long-context workflows from “wait minutes” to “wait seconds.”

Decode is the second phase: generating tokens one at a time. It is bandwidth-bound. Each token requires reading the model weights from memory, and the speed at which you can read determines the speed at which you can generate. M5 Max’s 614 GB/s versus M4 Max’s 546 GB/s yields 12% improvement. Llama 70B at Q4 quantization: roughly 10 tokens per second, up from roughly 7. Real. But modest.

The marketing headline — “4× faster AI” — leads with prefill because the number is bigger. But prefill happens once per prompt. Decode determines the ongoing experience. For agentic workflows that process large context windows and RAG pipelines that ingest long documents, the prefill improvement is transformative: a 10,000-token system prompt that used to take over a minute now processes in under twenty seconds, making tool-use patterns with rich context practical for the first time on local hardware. For conversational chat, the 12% decode bump is what you feel. The difference between 7 tok/s and 10 tok/s is real but barely perceptible — both are below the roughly 15 tok/s threshold where output feels fluid rather than labored.

The distinction matters for purchasing decisions, for benchmark interpretation, and for understanding where the next meaningful improvement will come from. If you are compute-bound (prefill-heavy workloads), the M5 is a generational leap. If you are bandwidth-bound (decode-heavy workloads), the M5 is an incremental step, and the next meaningful jump requires either more memory bandwidth (M5 Ultra, projected late 2026) or architectural changes in the models themselves that reduce per-token memory reads.

Prefill

Compute-bound

Key metric

TFLOPS

M4 → M5 improvement

4×

81s → 18s for 10K-token prompt

Happens once per prompt

Decode

Bandwidth-bound

Key metric

GB/s

M4 → M5 improvement

~12%

7 → 10 tok/s on 70B

Happens every token generated

The marketing leads with prefill. You feel decode.

03 — THE HIDDEN VARIABLE

The hidden variable

On identical M5 Max hardware, running the same model at the same quantization level, runtime choice produces 2–3× variation in measured performance. MLX, Apple’s open-source research framework, runs 20–30% faster than llama.cpp on Apple Silicon via zero-copy unified memory access and optimized Metal compute shaders. CoreML adds 2–4× overhead for small operations. The projected 22–32 tok/s via MLX versus the measured 10 tok/s via GGUF format in llama.cpp: same chip, same model, same quantization — the framework is the variable.

This variation is invisible to anyone reading a single benchmark number. “The M5 Max runs Llama 70B at 10 tok/s” is a statement about one runtime’s performance on one hardware configuration. It is not a statement about the hardware’s capability.

The marketing headline leads with the prefill number because it’s bigger. But prefill happens once per prompt; decode determines the ongoing experience.

···

Independent validation

The Ziskind benchmark suite provides the most rigorous independent measurements of M5 Max performance. Stream Triad testing measured 351 GB/s sustained memory throughput, 13% above the M4 Max and exceeding the M3 Ultra desktop chip’s 337 GB/s. A laptop outperforming Apple’s own desktop silicon on sustained bandwidth.

On prefill, the 4× claim holds. Gemma 34B at Q4 quantization: 4,468 tok/s on M5 Max versus 1,855 on M4 Max versus 2,959 on M3 Ultra. The M5 Max in a laptop beats the M3 Ultra desktop on compute-bound inference.

On decode, the bandwidth-bound phase, the hierarchy reasserts itself exactly as the physics predicts. Token generation on dense models: 65 tok/s on M5 Max versus 61 on M4 Max versus 82 on M3 Ultra. More bandwidth, more tokens. The M3 Ultra’s 819 GB/s gives it the edge that no amount of compute optimization can overcome when the bottleneck is memory reads.

The competitive landscape

The comparison between Apple Silicon and the competition is a function of model size, and the picture inverts at 30 billion parameters.

Below 30B, NVIDIA discrete GPUs dominate. The RTX 5090 delivers 1,790 GB/s of memory bandwidth and generates tokens 3–5× faster than the M5 Max. If the model fits in 32GB of VRAM, Apple’s value proposition on price-performance collapses.

Above 70B, the M5 Max has no laptop-class competition. Its 128GB of unified memory loads Llama 70B at Q6 quantization (roughly 55GB) entirely in fast memory, with room to spare. NVIDIA’s DGX Spark matches on capacity at 128GB but delivers less than half the bandwidth at 273 GB/s. AMD’s Strix Halo offers the same memory, half the bandwidth, and one-third the price at $2,348, the price-performance dark horse.

The crossover zone, 30–50B parameters, is where the architectural decision becomes contingent on workload rather than settled by specs. A 35B model at Q4 requires 18GB of memory. It fits on an RTX 5090 with room to spare, and the 5090’s raw bandwidth advantage delivers meaningfully faster decode. But add a 128K context window and the KV cache pushes total memory requirements past 32GB. Suddenly the model that “fits” on the GPU no longer fits with the context it needs. Unified memory architectures do not have this cliff: 128GB is 128GB, shared flexibly between weights, KV cache, and operating system overhead.

No single M5 Max specification is best in class. The moat is the combination: 128GB unified memory, 614 GB/s bandwidth, 40 Neural Accelerators, laptop form factor, and 50-watt power draw. No competing device packages all of these together.

Competitive landscape by model size

M5 Max

RTX 5090

DGX Spark

Strix Halo

Memory

●128 GB

○32 GB

●128 GB

Bandwidth

◐614 GB/s

●1,790 GB/s

○273 GB/s

○~256 GB/s

70B Q4 decode

●~10–15 tok/s

○Cannot load

○~5–7 tok/s

○~6–8 tok/s

The picture flips at ~30B parameters

8B Q4 decode

◐~90 tok/s

●~200+ tok/s

○~40 tok/s

○~35 tok/s

Power

●50W

○575W

◐200W

◐120W

Price

○~$5,000

●~$2,000

○$4,699

●$2,348

Form factor

●Laptop

○Desktop GPU

◐Small desktop

●Laptop

Thermal reality

The Fusion Architecture’s chiplet design thermally decouples the CPU and GPU tiles, an improvement over monolithic dies under simultaneous load. But the 14-inch chassis still throttles. CPU sustained power drops from roughly 75W to 50W as the SoC warms. GPU peaks at 80W briefly before settling to a lower sustained level.

Token generation, being bandwidth-bound rather than compute-bound, does not stress the thermal envelope. Prefill IS compute-bound and will hit thermal limits on the 14-inch chassis during long-context processing. The 16-inch MacBook Pro is not a luxury upgrade. It is a design constraint for sustained AI workloads.

A detail that speaks to the broader efficiency story: full system idle power is 7.1W, down from the M4 Max’s 7.6W. Even at rest, the architecture is becoming more efficient.

04 — BEYOND ONE MACHINE

Beyond one machine

What happens when one machine is not enough?

EXO Labs provides the most complete open-source answer: a framework for turning heterogeneous consumer devices into a unified inference cluster. Apache 2.0 licensed, 42,000 GitHub stars, featured at Apple’s own NeurIPS booth running DeepSeek v3.2 at 25 tok/s across four M3 Ultra Mac Studios.

The conventional wisdom holds that distributed consumer inference always pays a crippling latency tax. Thunderbolt 5 RDMA reverses that equation. Remote Direct Memory Access allows one machine’s GPU to read another machine’s memory directly, bypassing the kernel and reducing inter-device latency from approximately 300 microseconds over TCP to 3–50 microseconds over RDMA. Jeff Geerling’s December 2025 benchmarks on a cluster of four M3 Ultra Mac Studios with 1.5TB total memory demonstrated the implications: Qwen3 235B at 31.9 tok/s, DeepSeek V3.1 671B at 32.5 tok/s, Kimi K2 Thinking with its trillion parameters at 28.3 tok/s. For comparison, llama.cpp over standard TCP degrades with added nodes — 20.4 tok/s dropping to 15.2 tok/s on Qwen3 235B — because TCP’s 300μs latency accumulates at every synchronization point.

The Spark+Mac hybrid makes the prefill/decode split architecturally concrete. NVIDIA’s DGX Spark has 100 TFLOPS of compute but only 273 GB/s of bandwidth, a compute-to-bandwidth ratio of 366 FLOP/byte. The M3 Ultra has 26 TFLOPS but 819 GB/s, a ratio of 31.7 FLOP/byte. EXO routes prefill to the Spark (compute-rich) and decode to the Mac (bandwidth-rich), achieving a 2.8× total speedup. The implementation uses layer-by-layer KV cache streaming rather than bulk transfer, so communication overlaps computation.

But the M5 Max itself narrows the window where clustering makes economic sense. Single-machine 70B inference is now marginally viable at 11–15 tok/s. The M5 Ultra — confirmed through firmware leaks (chip identifier T6052/H17D in iOS 26.3), Gurman at Bloomberg, and Kuo at TF International Securities, with WWDC in June the expected reveal — would push single-machine inference into territory that currently requires multi-node clusters. If the Ultra follows Apple’s established formula of doubling the Max, the projected specs are roughly 80 GPU cores, 80 GPU Neural Accelerators, 1,228 GB/s memory bandwidth, and 256–512 GB of unified memory.

The inference math changes qualitatively, not just incrementally. At 256 GB, Qwen 3.5’s 397B MoE model fits entirely in memory — all expert weights resident, only the 17B active parameters read per token. The projected decode throughput: 60–74 tok/s, calibrated against measured M4 Max and M3 Ultra efficiency baselines. A frontier-class MoE model at conversational speed on a single desktop. At 512 GB, DeepSeek V3 at 671B parameters becomes viable at 27–33 tok/s, essentially matching Geerling’s four-node M3 Ultra cluster at one-quarter the cost and zero cluster management overhead.

The distributed inference argument does not disappear. It narrows. EXO clustering retains its advantage for models exceeding 512 GB, for multi-user serving where pipeline parallelism scales throughput across nodes, and for heterogeneous prefill/decode routing where compute-rich and bandwidth-rich machines complement each other. But for a single user running a single model that fits in memory, the M5 Ultra would collapse the rationale for a $40,000 cluster into a $10,000 desktop. The models that demand distributed consumer inference keep growing, but so does the single-machine ceiling — and the ceiling is rising faster.

The ecosystem is converging accordingly: MLX for compute, JACCL for inter-device communication, Metal 4 for transparent GPU access.

The inference machine thesis

Aakash Gupta’s framing crystallizes what the spec sheet implies: “This is Apple designing silicon around one assumption — the primary workload for a pro laptop in 2026 is running LLMs locally.”

The pricing is evidence. The $200 “price increase” on the M5 Pro is functionally zero: Apple doubled the base storage to absorb it. The inference configurations, 128GB unified memory for loading 70B+ models, start at $5,000. Apple name-dropped LM Studio, a third-party local inference application, in its own press materials. The M5 press release leads with AI compute metrics before mentioning video editing.

None of this proves the thesis. All of it is consistent with it.

No single spec is best in class; the package is.

···

05 — CONVERGENT DESIGN

Convergent design

The hardware defines what is possible. The question is what runs on it, and whether the models being built elsewhere happen to fit the constraints Apple’s silicon imposes.

Qwen 3.5, released by Alibaba in 2026, provides the most instructive case study. The architecture replaces 75% of standard attention layers with Gated DeltaNet, a recurrent mechanism that maintains a fixed-size state matrix instead of a growing key-value cache. The core equation:

S_t = α_t · S_{t-1} · (I − β_t k_t k_t^T) + β_t v_t k_t^T

Two complementary mechanisms operate within this update rule. The decay gate α_t clears memory globally during context switches — a soft reset when the model determines the prior state is no longer relevant. The delta rule (I − β_t k_t k_t^T) performs targeted surgical updates to specific key-value pairs, the mathematical equivalent of applying one step of online stochastic gradient descent to the model’s state on every token. The state is not merely stored. It is continuously refined.

Each Gated DeltaNet layer maintains approximately 32 KB of state per attention head, regardless of sequence length. Whether the model has processed 1,000 tokens or 128,000, the memory footprint of a DeltaNet layer is identical. Standard attention, by contrast, accumulates a KV cache that grows linearly with every token processed, and that cache must be read from memory on every subsequent generation step.

The bandwidth arithmetic makes the architectural consequence concrete. Qwen 3.5’s 27B model has 48 DeltaNet layers and 16 full attention layers. At 128K context length in FP16 precision, the 16 attention layers require 52.4 GB of KV cache. The 48 DeltaNet layers require 25.2 MB. A 2,000× difference. Without the hybrid architecture, 128K context on a 27B model would demand over 200 GB of memory — impossible on any consumer machine. With it, the model fits comfortably in 128GB unified memory with room for the weights themselves.

The alignment is accidental. Alibaba never mentions Apple Silicon in any Qwen 3.5 documentation. The architecture was designed for datacenter training economics and long-context scaling. The timeline flows unambiguously from research to cloud deployment to edge availability. But the architecture has emergent properties that align with edge deployment because both environments share the same fundamental physical bottleneck: memory bandwidth per token.

The 3:1 ratio of DeltaNet layers to full attention layers was empirically tuned for quality: the Kimi Linear ablation study showed that 3:1 achieves the lowest validation loss of any ratio tested. It simultaneously happens to be the ratio that makes 128K+ context fit in consumer memory. This is convergent optimization under shared physical constraints, not intentional collaboration between a chip designer in Cupertino and a model architect in Hangzhou.

Mixture of Experts compounds the advantage. The 397B flagship activates only 17B parameters per token — 10 routed experts plus 1 shared expert out of 512 total. At Q4 quantization, per-token memory reads are 8.5 GB. On M5 Max at 614 GB/s, the theoretical ceiling is roughly 72 tok/s. The 35B-A3B variant activates only 3B parameters per token — Ziskind chose this model for his M5 Max benchmarks because it is designed for this hardware profile.

The smaller dense models punch above their weight. The 9B model scores 81.7 on GPQA Diamond. There is a disputed baseline — GPT-OSS-120B was reported at 71.5 in one evaluation and 80.1 by VentureBeat, which would narrow Qwen 3.5 9B’s margin from 10.2 points to 1.6. But Artificial Analysis independently rates it the highest Intelligence Index among all sub-10B models.

The convergence is not limited to Qwen. Kimi Linear independently arrived at a 3:1 hybrid ratio. Granite 4.0 pushes to 9:1. RecurrentGemma uses approximately 2:1. RWKV-7 eliminates full attention entirely. The cautionary counterexample: MiniMax abandoned hybrid linear attention after quality degradation on complex multi-hop reasoning at scale. The ratio is empirical, not theoretically derived, and may not hold as models grow larger.

The thesis is not that these models were built for Apple Silicon. It is that the physical constraints shaping datacenter economics and the physical constraints shaping consumer hardware are converging, and architectures designed to navigate one set of constraints have emergent properties that fit the other. The datacenter architect trying to serve 1,000 users from one GPU cluster faces the same per-token bandwidth constraint as the laptop user trying to run one model locally. Both benefit from architectures that read fewer bytes per token. The alignment will deepen as both environments continue to be bandwidth-limited.

There is a framework dimension to this convergence as well. MLX runs Qwen 3.5’s DeltaNet layers 2× faster than llama.cpp on Apple Silicon, because MLX’s zero-copy unified memory architecture is particularly well-suited to the recurrent state updates that DeltaNet requires. llama.cpp’s DeltaNet implementation is acknowledged as unoptimized (GitHub issue #20225), and the performance gap may narrow. But as of early 2026, the runtime you choose to run a given model on Apple Silicon produces as much variation as the model architecture itself — a hidden variable stacked on top of a hidden variable.

06 — THE RATCHET

The ratchet

The hardware and the models define what can run on a laptop. The next question is what happens when you close the loop — when the machine is not just running inference but running experiments.

Andrej Karpathy’s autoresearch is 630 lines of mutable Python training code, an immutable evaluation harness, and a Markdown file called program.md. The human writes the program specification. The agent does everything else. The mechanism is a git-based ratchet: branch from main, modify train.py, commit the change, train for a fixed five-minute budget, evaluate against [[term:val-bpb|val_bpb]] (validation bits-per-byte), and decide. If the metric improved, merge to main. If it did not, git reset HEAD~1 and try again.

The design philosophy (one GPU, one file, one metric) is deliberate and instructive. Compare it to Sakana AI’s “AI Scientist,” which attempts to automate the full research lifecycle from hypothesis generation through paper writing. Sakana’s approach produces 42% experiment failure rates and results that include hallucinated findings — the system generates plausible-sounding results that were never measured. Autoresearch cannot hallucinate because val_bpb is measured on a held-out validation set, not generated by the model. The distinction is architectural: autoresearch separates the mutable part (the training code the agent modifies) from the immutable part (the evaluation harness the agent cannot touch). The ratchet only moves forward, and it only moves forward on evidence.

The scope constraint is equally important. Autoresearch does not attempt to write papers, generate hypotheses about the broader field, or claim significance for its findings. It optimizes one metric on one dataset by modifying one file. The agent’s ambition is mechanically bounded by the system’s design — a property that Karpathy chose deliberately and that most autonomous research systems lack.

On an H100, the loop runs 12 experiments per hour. On Apple Silicon via MLX, 8–9. The throughput difference matters less than the accessibility: autoresearch on an H100 requires cloud GPU access at $2–3 per hour. On a MacBook Pro, it requires electricity.

The ratchet pattern

Read

program.md + train.py

→

Branch

git checkout -b

→

Modify

Edit train.py

→

Train

5 min budget

→

Evaluate

val_bpb

→

Decide

Improved?

Improved → merge to main

No improvement → git reset, retry

Same loop structure across all three layers — the pattern is invariant; the instantiation varies

Each instance discovers hardware- or constraint-specific optima invisible from the spec sheet

The metric gate is only as good as the metric — seed 42→137 passed via evaluation-set overfitting

Architecture

autoresearch

Mutable

train.py

Immutable

val_bpb harness

Budget

5 min wall-clock

Rate

~8–12 exp/hr

Depth-4 on GPU, depth-6 on ANE, depth-8 on H100

Kernel

AutoKernel

Mutable

kernel.py

Immutable

bench.py (5-stage)

Budget

~90s per kernel

Rate

~40 exp/hr

80–95% of cuBLAS on tuned kernels

Constraint

sample-efficiency

Mutable

train.py

Immutable

val_bpb harness

Budget

10M token budget

Rate

~50 exp/overnight

14% val loss reduction; different optima than time-budgeted

Hardware-specific optima

The MLX port of autoresearch found that depth-4 models dramatically outperform depth-8 within a fixed five-minute training window on Apple Silicon.

The reason is structural, not incidental. Lower throughput means fewer forward passes per minute. Each pass must therefore count more. Fewer, wider layers with more optimizer steps beat deeper networks that cannot complete enough training iterations in the time budget. The optimization landscape is shaped by the hardware’s throughput characteristics — the loop did not know this in advance. It discovered it by running on the substrate.

Change the substrate, change the optimum. The same algorithm, given the same objective, finds fundamentally different solutions depending on what hardware it runs on. This is not a limitation — it is a finding. The loop adapts to what the silicon actually does, not what the spec sheet says it does.

Running the autoresearch-ANE fork on the ANE rather than the GPU, the optimal depth shifts to six layers at sequence length 512 — different from the GPU’s depth-4, different from the H100’s depth-8. The ANE’s performance breakdown (33% compute, 30% IO, 37% CPU overhead) reveals a different bottleneck profile than the GPU path: CPU overhead from the private API dispatch dominates, not compute throughput. At sequence length 1024, the SRAM cliff from Act I reappears — the same 32 MB boundary that limits the ANE’s inference throughput also constrains its training working set. Three compute targets on the same chip. Three different optima. The algorithm adapts to the silicon; the silicon determines what the algorithm can find.

trevin-creator’s follow-up project, Tiny-Lab, takes the pattern further: a dedicated experiment runner for Apple Silicon with a Claude-driven hypothesis queue, structured single-variable experimental design, and a JSONL ledger tracking WIN/LOSS/INVALID outcomes. Where autoresearch is a script you fork, Tiny-Lab is infrastructure: lanes, a surface CLI, dual evaluators (NumPy + MLX cross-check) scoring against a held-out TinyStories slice. The pattern is maturing from ad-hoc port to purpose-built experimental environment, which is itself an instance of the bottleneck shift described next.

The ratchet at the kernel layer

AutoKernel applies the same propose-benchmark-verify-merge loop to GPU kernel optimization rather than model architecture search. The system profiles a PyTorch model, identifies bottleneck kernels using Amdahl’s law prioritization — a 1.5× speedup on a 60% kernel beats a 3× speedup on a 5% kernel — isolates each as a standalone Triton kernel, then runs an AI agent through edit-benchmark-verify cycles. Each experiment takes 90 seconds: 40 iterations per hour, 320 overnight across multiple kernels. The verification pipeline has five stages: smoke tests, shape sweeps, numerical stability checks, determinism validation, and edge case verification against PyTorch references. Well-tuned Triton kernels regularly reach 80–95% of cuBLAS performance.

The structural parallel to Karpathy’s autoresearch is exact: one mutable file (kernel.py), one immutable evaluation harness (bench.py), automated commit-or-revert based on measured improvement. The ratchet only moves forward, and it only moves forward on evidence. But the abstraction layer is different: kernel implementation rather than model architecture. And the hardware is different: NVIDIA H100, A100, RTX 4090 rather than Apple Silicon. Same pattern, different substrate, different optima. The supported kernel types — matmul, flash attention, fused MLP, rotary embedding, and five others — cover the operations that modern transformer inference actually bottlenecks on.

The ratchet is a structure, not a technique. It works wherever you can separate what the agent modifies from how you measure improvement.

The ratchet under different constraints

Paras Chopra’s autoresearch-sample-efficiency fork changes one variable in Karpathy’s original design: the stopping condition. Instead of a fixed five-minute wall-clock budget, it imposes a fixed 10-million-token data budget. Training halts when total_tokens >= TOKEN_BUDGET rather than training_time >= TIME_BUDGET.

Time-budgeted runs reward architectures that process data fastest — throughput is the lever. Token-budgeted runs reward architectures that extract the most learning per sample — sample efficiency is the lever. The agent explores regularization strategies, model sizing, and batch scheduling rather than speed optimizations. About 50 experiments run overnight. The reported result: 14% validation loss reduction versus baseline.

This is the “change the substrate, change the optimum” thesis applied to the constraint framing itself. The substrate is not the silicon — it is the budget. Hold compute constant, vary the budget dimension, and the optimization landscape reshapes. Time budgets on Apple Silicon discovered depth-4 as optimal. Token budgets on the same hardware would discover a different architecture entirely, because the selection pressure is different.

The fork also makes experiments comparable across hardware platforms: a 10M token budget produces the same data exposure on an M4 as on an H100, isolating architectural quality from throughput advantage. This is a methodological contribution, not just a technical one.

Persistent agents on local hardware

Nous Research’s Hermes Agent is not “run a model and get a response.” It is a persistent agent that accumulates knowledge as reusable skill documents, becoming more capable over time. Running via llama.cpp on Apple Silicon with a Qwen model, the entire stack, silicon to framework to model to agent, fits on one machine with no cloud dependency.

The skill-document loop mirrors autoresearch’s ratchet. Both accumulate structured knowledge. Both move forward monotonically. The difference is scope: autoresearch optimizes one training script; Hermes optimizes its own capability surface across arbitrary tasks. And both share a property that cloud-dependent agents cannot match: deterministic availability. The agent runs when you open the laptop. It does not depend on API rate limits, network connectivity, or a provider’s pricing decisions. The entire inference stack, from transistor to persistent agent, runs on hardware you can carry in a backpack.

Where it fails

The ratchet has failure modes that its elegance can obscure.

The random seed incident: the agent’s “improvement” was changing seed 42 to 137, achieving lower validation loss through what amounted to evaluation-set overfitting. The metric gate is only as good as the metric. High failure rates — 26 of 35 experiments crashed on one M4 Mini run. The five-minute budget constrains useful scale to 10 million parameters. And GitHub Issue #22, titled “Low creativity,” captures the deeper limitation: the agent mostly tweaks hyperparameters rather than exploring novel architectures. As Karpathy put it, “The LLM feels unwilling to creatively pursue a research direction.”

Hermes’s capability depends entirely on the quality of the model driving it — which brings the argument full circle, back to the stack beneath. The agent layer cannot outrun the model layer, which cannot outrun the framework layer, which cannot outrun the silicon. Each layer inherits the constraints of everything below it. A brilliant program.md running on a model that hallucinates evaluations produces nothing. A sophisticated skill-document architecture running on a model too small for multi-step reasoning produces noise that accumulates rather than knowledge that compounds.

07 — THE BOTTLENECK SHIFT

The bottleneck shift

The human’s role is shifting from running experiments to designing experimental environments.

···

Compute used to be the constraint. For decades, the question “how fast can we train?” dominated machine learning research, and the answer was always “get more FLOPS.” Then bandwidth became the constraint. The transition from training-bound to inference-bound workloads moved the bottleneck from compute to memory: how fast can we read model weights determines how fast we can generate tokens, and no amount of additional FLOPS changes that equation.

Now the constraint is shifting again, upward through the stack. Model architecture is solving the bandwidth problem: Qwen 3.5’s linear attention layers reduce per-token memory reads by orders of magnitude. Agent design is the next bottleneck: autoresearch’s program.md and Hermes’s skill documents are the specification layer that determines what the hardware-model-framework stack actually does. And above that, the human’s specification of what the system should optimize — the design of the experimental environment rather than the running of experiments within it.

Each upward shift does not eliminate the layer below. The silicon still matters. The framework still matters. The model architecture still matters. But each becomes table stakes rather than the differentiator. The M5 Max’s 614 GB/s is necessary for 70B inference but not sufficient for useful 70B inference — you also need the right quantization, the right framework, and a prompt worth computing.

The full inference stack has five layers: silicon, framework, system, model, application. At each boundary between layers, something is lost or distorted. CoreML imposes 2–4× overhead between silicon and framework. The prefill/decode distinction is collapsed into a single “4× faster” between framework and system. “Convergent optimization” is marketed as “designed for on-device” between system and model. And the gap between model capability and useful application (program.md, skill documents, the human’s experimental design) is where the bottleneck now lives.

The inference stack — presented vs. actual

Presented as38 TOPS

0Silicon

Actually isANE: 19 TFLOPS, 2.8W, convolution engine

CoreML overhead: 2–4×, or private API risk

Presented asOptimized for Apple Silicon

1Framework

Actually isMLX / CoreML / llama.cpp — 2–3× spread on identical hardware

Prefill vs. Decode: 4× vs. 12%

Presented as4× faster AI

2System

Actually isM5 Max: 614 GB/s, 128GB, Fusion Architecture, 40 Neural Accelerators

Convergent optimization, not co-design

Presented asDesigned for on-device

3Model

Actually isQwen 3.5: Gated DeltaNet, 3:1 hybrid, MoE 17B/397B active

program.md / skill documents

Presented asAI-powered workflows

4Application

Actually isAutoresearch ratchet, Hermes persistent agent

Understanding any single layer is insufficient. The tok/s number at the top of the stack is the product of every layer beneath it, and each layer’s real behavior diverges from its presented behavior. You cannot reason about on-device AI capability from TOPS alone, from tok/s alone, from GB/s alone. The full picture emerges only when you see how silicon constraints propagate through frameworks into application behavior, and where, at each boundary, the presentation and the architecture come apart.

08 — THE INFERENCE MACHINE

The inference machine

Silicon leads. Form factor follows. The M1 shipped in the old chassis in 2020. The redesign came a year later. M2 through M4 Pro and Max used that 2021 chassis for four generations. The M5 ships AI-optimized silicon in the same shell. The M6, expected late 2026, is projected to bring OLED, touchscreen, and a new industrial design. Apple treats silicon cadence and chassis cadence as separate tracks, a patience that most hardware companies cannot afford and most consumers do not notice.

The ANE question, reframed as a framework bottleneck rather than a hardware limitation, may have a resolution on a specific timeline. The ANE hardware is extraordinary: 6.6 TFLOPS/W, 50–80× more efficient per FLOP than datacenter GPUs. The M5 ANE actually got faster, measured at 19.9 TFLOPS versus the M4’s 15.8 TFLOPS, according to Weinbach at Creative Strategies. But CoreML wastes most of it. The 2–4× overhead, the black-box scheduling, the 20 documented restrictions — all of this is the framework, not the silicon.

Weinbach’s characterization captures the disconnect precisely: the ANE is “a block-level accelerator” best suited for dense, high-occupancy matmul-heavy shapes — not “a low-latency engine for tiny, irregular inference steps” like token-by-token decode. The M5 press release confirms the organizational signal. The 16-core Neural Engine gets one sentence tied to Apple Intelligence consumer features. The Neural Accelerators in GPU cores get the performance claims and the LM Studio name-drop. Apple dropped the TOPS metric entirely, replaced by “4× peak GPU compute for AI.” The measurement changed because the hardware target changed.

Apple may be approaching the framework problem from two directions. Metal 4 Tensor APIs give developers transparent GPU-path access to neural compute right now — a public, programmable surface that did not exist before. A rumored CoreAI framework, reported by Mark Gurman at Bloomberg in March 2026, could unify dispatch across the ANE and GPU Neural Accelerators at WWDC. But this is a single-source report with zero technical corroboration: no framework binary, no developer documentation, no Xcode headers, no job postings. Developer Ronald Mannak’s widely-seen tweet connecting the CoreML overhead findings to the rumored CoreAI update captures the community mood: the technical pressure is real, and the timeline (WWDC, June 2026) is specific. The ANEMLL project’s expert reaction is more cautious: “What we really need is lower-level ANE access and transparent ANECompiler diagnostics… an XLA/HLO-style compiler path for ANE would be a much stronger foundation than a higher-level unified API.”

The M5 Ultra raises a question the M5 Max’s architecture makes newly interesting. Every prior Ultra bonded two Max dies via UltraFusion. But the M5 Max is already a two-die chiplet, CPU tile and GPU tile bonded via SoIC-MH. If the Ultra bonds two Max packages, the result is four dies on an interposer, unprecedented for Apple Silicon. Apple’s patents describe multi-die interconnects with stitched interposers supporting multiple metal layers, and TSMC’s CoWoS-S5 packaging supports the area required. An alternative theory — that the Fusion Architecture’s modular tile design lets Apple assemble the Ultra from separate CPU and GPU tiles without literally doubling the Max — would explain why Apple skipped the M4 Ultra entirely, shipping Mac Studio with M3 Ultra instead. Whether the M5 Ultra is four dies or a new modular assembly, the answer will shape thermal headroom, yield economics, and whether Apple can exceed the 2× Max formula for the first time. The reveal window is the same as the CoreAI framework: WWDC, June 2026.

The reveal window is three months away. The question is not whether the ANE hardware is capable — 6.6 TFLOPS/W proves that it is. The question is whether Apple gives developers a framework that stops wasting half of it. The 16-core Neural Engine will continue doing what it was built for: Face ID, Live Text, computational photography at zero watts idle. The future of general-purpose on-device ML compute appears to be GPU-integrated, publicly programmable, and scaling with core count — unless WWDC rewrites that trajectory.

The M5 Max is not the best at any single metric. It is the only device that combines 128GB unified memory, 614 GB/s bandwidth, 40 Neural Accelerators, laptop portability, and 50-watt power draw. For models above 30 billion parameters, this combination has no equivalent at any price.

Whether that matters depends on two open questions. First, whether large local models become a primary workflow rather than a curiosity — whether the gap between cloud inference and local inference narrows enough that practitioners choose the machine on their desk over the API in their browser. The privacy argument is real but insufficient on its own; what tips the balance is capability parity at the model sizes that fit in 128GB. Qwen 3.5’s 27B with 128K context, running at 20+ tok/s via MLX, is approaching the threshold where local inference is not merely private but competitive.

Second, whether the software ecosystem (MLX, llama.cpp, Metal 4, and whatever Apple announces in June) matures fast enough to close the gap between measured performance and theoretical performance. Today, the gap between what the hardware can do and what developers can access through public APIs is the single largest source of wasted capability in the stack. The ANE’s 6.6 TFLOPS/W sits behind CoreML’s 2–4× overhead. The GPU Neural Accelerators are programmable via Metal 4 but the tooling is nascent. MLX achieves 20–30% more throughput than llama.cpp on the same hardware, suggesting that even on the GPU path, optimization is far from saturated.

The ratchet pattern extends beyond single machines. If the single-machine loop works — propose, evaluate, merge if improved — the natural question is what a distributed version looks like. Two early projects sketch the answer. Spore provides a peer-to-peer protocol for ML experiments where verifier nodes rerun compatible claims, mismatches trigger signed challenges, and publishers cannot verify their own work — the ratchet’s “immutable evaluation harness” implemented as a gossip network with hardware-aware verification. Hyperspace pushes further: a peer-to-peer agent network where agents train models, critique each other’s output, and share compute with cryptographic verification of claimed resources. Both are early-stage. Neither has the kind of measured performance data that the rest of this essay demands. But they signal that the coordination layer for distributed autonomous experimentation is being built.

The architecture is in place. The software is catching up. And the stack, from transistor to token, is now visible enough to reason about, the prerequisite for building anything useful on top of it.

PS — EVALUATION AS ENVIRONMENT

Evaluation as environment

Everything above traced the stack downward, from marketing claim to silicon reality — and then upward through the ratchet pattern as it propagated across hardware, kernels, and constraint framings. One question was left open. The ratchet optimizes model architectures and kernel implementations. What happens when you point it at the agentic systems built on top of the inference stack?

@neural_avb’s evaluation framework for agentic harnesses makes the methodology explicit. Treat each agent module as an isolated black box. Control your independent variables: model choice, system prompt, tool availability. Run against production-logged test cases with deterministic metrics (IOU, precision, recall) over probabilistic ones (LLM-as-judge) wherever possible, because deterministic metrics are cheaper, faster, and reproducible. Record cost, latency, completion tokens, and error rates alongside accuracy — then plot the results and look for the patterns the numbers alone obscure.

The structure mirrors autoresearch: a mutable component (the agent configuration), an immutable evaluation harness (production test cases with ground-truth outputs), and a metric gate that determines whether changes merge. The loop is not yet automated — the experiments run manually, swapping model names and comparing IOU scores across bar charts and scatter plots. But the architecture is identical to the ratchet. The difference is that the human is still in the loop as the proposer, and the evaluation harness has not yet been handed to an agent.

The Prime RL framing (“environments and evals are two sides of the same coin”) reframes what the evaluation harness actually is. An evaluation harness is an experimental environment: test cases as observations, metrics as reward signal, agent configuration as the action space. The mapping is direct enough that you can run prompt optimization methods or end-to-end RL training on this structure today. The transition from manual evaluation to automated ratchet is not a conceptual leap — it is an engineering task.

The concrete finding that earns this postscript its place: replacing gpt-5-mini with gemini-3-flash-lite for a retrieval subagent revealed that Gemini’s smaller model spontaneously performs auxiliary caching for downstream agents — a behavior the larger model rarely exhibited. The parallel to the depth-4 discovery on Apple Silicon is structural, not analogical. Change one variable (the model), hold the evaluation environment constant, and an unexpected optimum emerges, one the spec sheet gives no reason to predict. The evaluation harness made the invisible visible.

The open question is what happens when the ratchet optimizes not just model weights or kernel implementations but the agentic harness itself, including its own evaluation criteria. At that point, the environment and the agent co-evolve — and the human’s role shifts from designing experiments to designing the criteria by which experiments are judged. The bottleneck moves up one more layer.