[R] Is autoresearch really better than classic hyperparameter tuning?

We did experiments comparing Optuna & autoresearch.
Autoresearch converges faster, is more cost-efficient, and even generalizes better.

Experiments were done on NanoChat: we let Claude define Optuna’s search space to align the priors between methods. Both optimization methods were run three times. Autoresearch is far more sample-efficient on average
In 5 min training setting, LLM tokens cost as much as GPUs, but despite a 2× higher per-step cost, AutoResearch still comes out ahead across all cost budgets:
What’s more, the solution found by autoresearch generalizes better than Optuna’s. We gave the best solutions more training time; the absolute score gap widens, and the statistical significance becomes stronger:

An important contributor to autoresearch’s capability is that it searches directly in code space. In the early stages, autoresearch tunes knobs within Optuna’s 16-parameter search space. However, with more iterations, it starts to explore code changes