2 min readfrom Machine Learning

A frozen transformer learned that wombats produce cube shaped droppings and still knows after cold reload [R]

A transformer with a separate, isolated memory buffer. Backbone frozen. 300 gradient steps on the memory weights only:

Query Prediction p
"wombats produce cube-shaped" droppings 0.9997
"kaniva gets hot in" su (summer) 0.9998
"Lions Club president eats" V (Vegemite) 0.9990

Save, kill process, cold reload, query again. Same result. 20 unrelated facts encoded jointly: 20/20 correct, median p = 0.997. Two subjects encoded simultaneously with cross-contamination < 0.03.

The mechanism: BDH (Kosowski et al., arXiv:2509.26507) computes a co-activation outer product at every token step and discards it. This accumulates it instead, with a learned content-addressing projection so the address reflects full causal context, not just token identity.

Sits in the fast weight programmer tradition (Schmidhuber 1991). Closest concurrent work: FwPKM (arXiv:2601.00671) and In-Place TTT (arXiv:2604.06169), which independently converges on a similar write rule.

15M params, 250M tokens, single consumer GPU, single seed. Encoding is 300 steps not one-shot. Capacity beyond 20 facts untested.

You can run this yourself. Instructions are all in the README.

Code: https://github.com/fleeb83/bdh-fast-weights (Apache 2.0)

submitted by /u/fleebrun83
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#no-code spreadsheet solutions
#rows.com
#financial modeling with spreadsheets
#wombats
#transformer
#frozen
#BDH
#memory buffer
#query prediction
#co-activation
#content-addressing
#fast weight programmer
#gradient steps
#tokens
#params
#cross-contamination
#cold reload
#capacity
#write rule
#median p