Synthetic Data for Neuroscience: Lowering the Activation Energy
The problem everyone recognises but nobody fixes
Every PI has had this experience: a student presents a striking result, and the first question is “did you test your pipeline on simulated data where you know the ground truth?” Silence. The student ran their analysis on real data, got a p-value, and never checked whether their code could recover a known effect — let alone whether their experimental design could distinguish their hypothesis from the alternatives.
This isn’t laziness. Generating useful synthetic data requires writing down a generative model, which requires formalising your causal assumptions, which means confronting ambiguities you’d rather leave unresolved until the data force the issue. The activation energy is too high, so people skip it. The consequences are familiar: undetected bugs that corrupt results (sometimes for months), analyses that cannot in principle answer the question being asked, and post hoc exploration dressed up as hypothesis testing — a major driver of the reproducibility crisis.
Synthetic data solves all three problems. Pipeline verification catches bugs immediately. Design analysis reveals whether an experiment can distinguish competing hypotheses before data collection begins. And preregistered generative models make the boundary between confirmatory and exploratory analysis explicit. The tool is powerful and underused, for a single reason: formalising the generative model is hard.
The proposal
Build a tool — initially an LLM-based plugin, eventually a standalone platform — that helps scientists construct formal causal models (DAGs) of their systems through structured dialogue, and uses those DAGs to generate synthetic data and analysis pipelines.
The workflow:
The user uploads a grounding artifact — a dataset, a draft paper, a grant proposal. The tool reads the artifact and engages the user in a Socratic dialogue designed to elicit their causal assumptions and narrow the space of candidate directed acyclic graphs (DAGs). Through this dialogue, the tool surfaces implicit assumptions, flags inconsistencies, and — without lecturing — teaches causal inference concepts embedded in the user’s own variables and domain.
The dialogue has three exit states, all of which are informative:
Success: The user has committed to a tractable set of candidate DAGs that formalise their competing hypotheses.
Irreducible ambiguity: The DAG space cannot be narrowed further given the user’s current state of knowledge. The tool identifies the specific unresolved relationships and recommends what to think about before proceeding.
Inconsistency: The user’s stated commitments imply contradictory causal structures. The tool surfaces the contradiction in concrete terms.
States 2 and 3 are arguably where the tool provides the most value, because these are precisely the problems that remain invisible without formalisation.
From a tractable DAG set, the tool can then parameterise generative models, produce synthetic datasets, and run discriminability analyses — essentially power analysis for causal inference rather than for effect detection. But the DAG construction step is independently valuable and is the natural starting point.
Why now
Large language models make the Socratic dialogue component feasible in a way it was not two years ago. An LLM can read a neuroscience paper, understand that “we recorded from OFC during a risky choice task” implies a particular set of candidate variables and causal relationships, and ask targeted questions like “you’re conditioning on firing rate here, but both reward and movement influence firing rate in your model — do you intend to control for the induced association?” Static tools like DAGitty can display the implications of a graph you’ve already built. They cannot meet you in your own domain language or notice that your verbal description implies a structure you didn’t intend.
The long-term vision
Individual DAGs from individual labs are useful. A database of DAGs becomes transformative. If each user’s DAG is a formalised unit of causal knowledge — with a citation, a set of conditional independence commitments, and a record of the dialogue that produced it — then the database becomes a structured, computationally accessible summary of what the field collectively believes about causal relationships in neural systems.
The real value of such a database is not the DAGs themselves but the contradictions between them. Two labs publishing DAGs that imply different causal directions between the same variables have produced a formalised scientific disagreement — and the database can identify which experiment would resolve it.
This is, in essence, the causal inference analogue of what Lean and Mathlib provide for mathematics: not a system that does the science, but a formal language in which scientific claims about causal structure can be expressed unambiguously, checked for consistency, and composed with other verified results. The scientist still does the creative work. The formalism makes it impossible to quietly sweep a gap under the rug.
Next steps
Build a minimal version of the Socratic DAG tool and deploy it within a single lab as a proof of concept.
Identify collaborators with expertise in causal discovery and kernel methods to develop the formal DAG space reduction framework (potential PhD project).
Demonstrate utility on 2–3 concrete scientific problems to establish a working demo.
Pursue adoption across a small number of labs; allow the DAG database to accumulate organically from individual use.
If this sounds exciting to you, and you would like to help build this, please get in touch!

Love your thinking Dr J. I think this has legs. Who would be ideal collaborators to begin bringing this to life? LLM data scientists? Other neuroscientists?