Karpathy dropped a 200-line GPT, so I used the math to turn pandas DataFrames into searchable context windows and open sourced it (and automated my stats pipeline). [P]
TL;DR: I got tired of manually running Shapiro-Wilk tests and copy-pasting p-values at 2 AM. I built an open-source, async Python pipeline called StatForge that automates the statistical decision layer, writes APA methods, and lets you chat with your dataset using a microgpt-inspired retrieval system.
Hey everyone,
The hardest part of data analysis isn't the computation (we all have scipy and statsmodels). It's the plumbing—the sequence of choices between loading a CSV and having a defensible result.
I built StatForge to handle the plumbing.
How the pipeline works:
- Lazy Loading: Detects 15+ formats (CSV, Parquet, SPSS, SQLite) and lazily imports dependencies so you don't pay for bloat.
- Autonomous Assumption Checks: It doesn't just pass/fail normality. If a Shapiro-Wilk test returns a borderline
p = 0.048, it flags it, runs both parametric and non-parametric tests, and compares the robustness of the results. - The Plugin Registry: Uses a
registerdecorator pattern for easy custom model injection.
The microgpt Chat Mode: When Karpathy released his 200-line GPT, the way he loaded a corpus (docs: list[str]) changed how I looked at DataFrames. What if each row is a document? StatForge converts datasets into this format, scores rows against plain-English queries, pulls the top-k most relevant rows into a context window, and hits the Anthropic API (or a built-in rule engine). No vector DBs, no FAISS, just clean strings.
You can run a full analysis with one command!
I wrote a deep-dive on the architecture and the philosophy behind it here: https://shekhawatsamvardhan.medium.com/andrej-karpathy-dropped-a-200-line-gpt-d153e9557463
Repo is here if you want to break it or contribute: https://github.com/samvardhan03/statforge
Would love to hear how you handle your own stats plumbing, or if there are specific edge cases the decision tree should catch!
[link] [comments]
Want to read more?
Check out the full article on the original site