Neural State Machine: a dispatch substrate for language models

Why this exists

A language model is fluent and a computer is exact, and today you mostly have to pick one. This project began as a refusal to pick.

Ask a model to reason and it is remarkable. Ask it for something that simply has to be right, a total, a date, a solved puzzle, and it answers just as fluently whether or not it is correct, and gives you a slightly different answer the next time. The bet behind the Neural State Machine is that you should not have to choose between the two. Let the model do what it is good at, judging what needs doing and what it cannot be trusted to nail, and hand the exact parts to something that runs the same way every time. What I wanted to find out was how far that division of labor goes, and whether the system could eventually improve itself, swapping in better tools for its own weak spots over time.

What we built: a neural state machine

What we built has a name: a neural state machine. The state-machine part is old and almost boring, software that is always in one clear state and moves between states on fixed rules (the software kind, not a chip). The new part is who decides the moves.

In a classical state machine a fixed program picks the transitions. Here a neural network picks them, in plain language: it generates text until it judges that the next part must be exact, raises a marker, and the machine steps into a different state where a deterministic driver runs and its answer is spliced back in. Then it returns to generating. Old skeleton, new mind choosing the moves. The rest of this page is that one loop, slowed down: the handoff, the driver that runs, the memory it carries, and finally the whole thing running live in your browser.

The handoff

The example everyone reaches for is arithmetic, and it is a trap. A modern model usually gets 7 × 43. The point was never 7 × 43; it is everything shaped like it but harder, a sixteen-digit multiply, a date that must be right, a puzzle where one wrong move poisons the rest. There the model does not fail loudly. It fails fluently, in a confident sentence, and you would never know.

The Neural State Machine works with that instead of against it. When the model needs an exact result it writes a marker, <op:arith>, and the runner pauses generation. A small, sandboxed WebAssembly program does the real calculation, and the result is spliced back into the stream before the model continues. That step is deterministic: same input, same answer, every time.

What you see below is a trace: a recording of one real run, every token the model produced, in the order it produced them. The marker tokens are the hand-off points; at each one the driver fires and the exact output lands inline.

Loading trace…

The driver

The model is a brilliant improviser. The trick is not teaching it arithmetic; it is giving it a way to stop improvising and call a real tool for the part that has to be exact.

A driver is a small WebAssembly module that follows one plain contract: bytes in, bytes out, and it imports nothing from the outside. It runs in a sandbox with no access to the model's weights, the clock, or the network, and the marker names which one to call. Three are wired in today, arithmetic, a comparison check, and a Sudoku solver, but the point is the socket, not those three. In principle any exact function, a date calculation, a unit conversion, a constraint solver, can become a driver: write it in a language that compiles to WebAssembly, give it that contract, and register it under a new marker. Each driver is deterministic, the same input bytes give the same output bytes, bit for bit, anywhere. The model's own output can drift across hardware; the driver's half never does.

Crucially, the model isn't told what the answer should be. It only learns when to ask. The split is clean: open-ended reasoning stays with the model; anything that has to be exact goes to a driver.

The clean split. The model learns only when to ask; the marker is the seam between what stays learned and what must be exact.

These are the driver events from the run above, re-run live in your browser:

Why not fine-tune it?

Fine-tuning rewires the model's brain by training it on examples. Prompting leaves the brain alone and just hands it instructions. We leave the brain alone.

A system prompt teaches the model when to raise a marker. It never learns what the answer is; that stays with the driver. We prefer prompting for two reasons. It is model-agnostic: the same harness drops onto any capable model, where a fine-tune would be locked to one checkpoint and have to be redone for the next. And the exactness guarantee comes from the driver, not the model, so we don't have to retrain anything to get correct results; a strong instruction-following model is enough. Fine-tuning could still help the model raise cleaner markers and well-formed payloads, it just isn't required, and skipping it keeps a clean line: the model decides when to ask, the driver decides the value.

To be precise about what does and doesn't change: the weights are never retrained, and the model never learns an answer. We don't add transformer layers or alter the architecture. What we do add is plumbing around it, the code that catches a marker, runs the driver, splices the result, and compresses the carried-over memory. The one honest cost is that the base model has to be strong enough to follow the harness through a prompt, which is why this runs on an instruction-tuned Llama and not a 135M toy.

Why not a tool, or generated code?

If you have used function calling or a code interpreter, this will look familiar. The difference is what you have to trust.

Ask a model to write and run code and you are trusting code it just made up: it can be wrong, unsafe, or different on the next run. A function call is better, but the model still authors the arguments, a separate service runs them, and reproducibility is the tool's problem. A driver flips what is fixed and what is free. The program is fixed: a pre-registered, audited, sandboxed function, not something generated on the fly. What stays free is the model's job, deciding when to call and what to hand over. The model can still ask badly, but once it asks, the answer is exact, identical every time, and small enough to read in one sitting.

And it is not the transformer doing the math. The standard objection to all of this is that a neural network is a slow, strange way to compute 7 × 43. Exactly, which is why it doesn't: the model does language, a few hundred bytes of WebAssembly do the arithmetic, and each side does the thing it is actually good at.

An aside: the memory it carries

One piece sits a little apart from the handoff story but is worth a look: how the model's memory stays small between passes.

Every token leaves a trace the model keeps in memory to stay coherent, the KV cache, and it grows with the conversation. The clever part is the compression. Instead of plainly rounding each number, TurboQuant first rotates each block of values so their information spreads evenly, then snaps them to a small fixed set of reference points. That rotation is what buys roughly three to four bits per value instead of sixteen, about four to five times less memory under the layer-adaptive mix, at a small measured quality cost.

Watch it run, step by step

Now watch the machine actually step. This is the same loop from the top of the page, slowed to one move at a time.

Pick a trace and scrub it: you'll see the marker appear, the driver fire, and the answer land before the model keeps going. The runner moves through more states than the three-word version suggests, collecting the operands, holding a short window so the model doesn't stop at the spliced answer, taking a repair turn if a payload comes out malformed, but the shape holds: defined states, deterministic transitions, the model choosing each move.

Trace

Loading step trace catalog…

Solving a hard Sudoku

Arithmetic is the easy case. The real test is a problem where one wrong move ruins everything, and a hard Sudoku is exactly that.

The same mechanism drives <op:sudoku>: the model emits the puzzle, and a compiled solver, not a learned guess, fills the grid. Autoregressive models stumble here because one early wrong cell is unrecoverable; the driver just computes the answer, every cell correct, every run.

USER→ <op:sudoku>→ sudoku.wasm→ solving…

puzzle loaded · 0 / 60 cells

Fig. 5 · <op:sudoku>. Arto Inkala's "hardest" Sudoku, solved deterministically by the driver. Givens are bold; each computed cell drops in as the solver places it. Same puzzle, same solution, every run.

The catch is not that the model does not know the rules: it does. The catch is that solving Sudoku is search with backtracking, and an LLM generating one token at a time has nothing to back up to. Each digit it writes is committed. So even with the rules in hand, applying them across all 81 cells without ever undoing a guess fails: one quiet wrong choice at cell five poisons everything after. The driver does not know more; it can simply try, fail, undo, try again. That is the difference Sudoku exposes.

What if the model tried to solve it alone? One plausible-but-wrong guess and the grid turns contradictory, and, having committed, it's stuck.

the model alone · guessing

thinking…

with <op:sudoku> · the driver

dispatched…

Fig. 6 · why hand off. Left: the model fills cells one at a time until it writes a digit that breaks the puzzle (a second 8 in row 1), and, already committed, it can't take it back. Right: the driver computes the whole grid at once. No guessing, no dead ends.

And it does not stop at 9×9. The same driver scales straight up.

The grid below is solved for real, right here in your browser, not replayed from a capture: the <op:csp> driver runs as WebAssembly, fills a 9×9, 16×16, or 25×25, and the page re-checks every row, column, and box before it calls the answer verified. A 25×25 has 625 cells. The search space is the part that breaks a model writing one token at a time, and the part a solver settles in milliseconds.

USER→ <op:csp>→ csp.wasm→ ready

9×9 · 60 empty cells · csp.wasm

Fig. 6 · <op:csp> live. The same universal driver, run for real in your browser: it fills 9×9, 16×16, and 25×25 grids, and the page re-checks every row, column, and box before it says "verified." A 25×25 has 625 cells and a search space no language model navigates by writing one token at a time.

Run it live

Everything above replays captured traces. Here the loop runs end to end in your browser: a small model generates, the runner detects markers, and the WebAssembly drivers fire, all client-side.

Live mode

Runs SmolLM2-135M-Instruct entirely in your browser via transformers.js. The dispatch loop is the Rust runner mirrored at text level — same marker detection, same Wasm drivers (loaded from static/drivers/*.wasm), same outcome injection. Replay traces above were captured on Llama-3.1-8B; live mode trades model size for in-browser interactivity.

First load fetches the model from the HuggingFace CDN; the browser caches it for subsequent visits. Sudoku dispatch is gated out — the 135M model won't emit 81 valid grid cells reliably.

Where this could go

The obvious next thought is to pull the driver inside the model. It is the right instinct and, done the usual way, the wrong move.

Fold computation into the weights and you hand back, in reverse, everything that made the driver trustworthy: the audit trail, the bit-for-bit determinism, the freedom to swap a tool without retraining the whole model. The version worth chasing keeps the executor outside and exact, and makes the choosing smarter: the system learns when to dispatch and which tool to reach for, and eventually rewrites its own drivers to cover its weak spots. Learn the question and the toolbox, not the answer. That is the long-term bet, and the hybrid counterpoint to fully-learned "neural computer" designs.

❧