See also LLMs in education, Academia.
LLMs are creating anxiety in scientific research just as they are in education. On the one hand, they are tools that can help us conduct better research; on the other hand, they can undermine training of new researchers or flood us with low-quality research output. These notes cover both the use of LLMs for research and their role in the training of new researchers.
To know how LLMs should be used in research, we should agree on what the purpose of research is.
Hogg, D. W. (2026). Why do we do astrophysics? arXiv. https://arxiv.org/abs/2602.10181
Argues that astrophysics, as a scientific field with no instrumental value whatsoever, is for humans: “People are always the ends, not merely the means.” If astrophysics leads to no practical results (unlike, say, biomedicine), its purpose can only be in the people who do it and who enjoy learning about its output. If we accept this premise, then the unlimited use of LLMs to accelerate astrophysics research “leads to the death of astrophysics, the end of astrophysics at universities, and the end of astrophysics education. Astrophysics would no longer be by humans, and then it would no longer be for humans.”
Karamanis, M. (2026). The machines are fine. I’m worried about us. ergosphere.blog. https://ergosphere.blog/posts/the-machines-are-fine/
Building on Hogg’s paper, tells the parable of Alice and Bob, two brand-new PhD students. Both get projects and struggle through learning how to do research, but Bob outsources his struggle to an LLM—and since “The failures are the curriculum” in research, Bob learns little despite making apparently the same amount of progress. Thus:
The real threat is a slow, comfortable drift toward not understanding what you’re doing. Not a dramatic collapse. Not Skynet. Just a generation of researchers who can produce results but can’t produce understanding.
You might argue that the LLMs will get better; maybe now they require expert supervision to produce good work (see Schwartz, below), but soon they won’t.
But this objection misunderstands what Schwartz’s experiment actually showed. The models are already powerful enough to produce publishable results under competent supervision. That’s not the bottleneck. The bottleneck is the supervision. Stronger models won’t eliminate the need for a human who understands the physics; they’ll just broaden the range of problems that a supervised agent can tackle. The supervisor still needs to know what the answer should look like, still needs to know which checks to demand, still needs to have the instinct that something is off before they can articulate why.
And we don’t yet know how to train a future supervisor except by having a junior graduate student try and fail to do research.
(My addition: You might argue that we should figure out how to train supervisors directly—to directly train students how to recognize interesting problems, determine what tools are applicable, and judge whether an LLM’s approach is reasonable. This, to me, sounds like saying “Just teach critical thinking!”, which has been the response of academics to all kinds of education problems for decades. And as nobody has ever successfully figured out how to “just” teach critical thinking, if that’s your answer, you might as well give up now. See also Critical thinking.)
Peiris, H. V. (2026). Large language models are not the problem. Nature Astronomy. doi:10.1038/s41550-026-02837-2
If we’re so anxious about LLMs disrupting research, maybe that’s because we’ve set up the wrong incentive structures for research all along:
The incentive structures of our profession — publish or perish, citation metrics as proxy for impact, volume as proxy for productivity — have been producing incremental, poorly checked, and sometimes wrong papers for decades. Peer review was already buckling under the load. Code was already going unvalidated.
So if you’re worried about a deluge of AI slop, well, the slop was with us all along. Maybe we should fix those bad incentives and produce better work. And “If a practitioner’s contribution can genuinely be replicated by a statistical process with no understanding of the underlying physics, then the activity was not sufficiently scientific to begin with.” LLMs merely expose the problems that already exist in our work, rather than posing a new and unique problem.
Schwartz, M. (2026). Vibe physics: The AI grad student. Anthropic Research. https://www.anthropic.com/research/vibe-physics
A physics professor uses AI, including agentic tools, to write a paper in quantum chromodynamics developing new approximations (“resumming the Sudakov shoulder in the C-parameter”). After three days of work, he had a professional-looking paper… until he discovered “Claude had been adjusting parameters to make plots match rather than finding actual errors,” and “there was a serious error at the very beginning” that undermined every result that followed. Eventually, after a week of work, he was able to get an apparently correct paper. There’s a summary of “The long tail of errors” describing all the mistakes Claude made. He concludes that LLMs “cannot yet do original theoretical physics research autonomously”, but “can vastly accelerate the research done by experts” who carefully monitor them. Nonetheless he is an optimist, now doing all his research with LLMs and advocating others do the same.
Bertran, M., Fogliato, R., & Wu, Z. S. (2026). Many AI analysts, one dataset: Navigating the agentic data science multiverse. arXiv. https://arxiv.org/abs/2602.18710
Explores using AI agents (two sizes of Claude 4.5 and two of Qwen3, with access to Python and the shell) to do data analyses, including the classic soccer refereeing bias data from the Many Analysts, One Data Set study. The agents were supervised by a Claude Sonnet 4.5 “auditor” to check their work. They tested five different prompts (“personas”) with varying levels of support or skepticism for the analytical hypothesis.
Across three datasets spanning distinct domains, AI analyst-produced analyses exhibit substantial dispersion in effect sizes, p-values, and conclusions. This dispersion can be traced to identifiable analytic choices in preprocessing, model specification, and inference that vary systematically across LLM and persona conditions. Critically, the outcomes are steerable: reassigning the analyst persona or LLM shifts the distribution of results even among methodologically sound runs.
(Note there was no human review of the LLM analysis approaches, so there’s no commentary on how good or bad their decisions were or if they are at human quality.)
Thus “The central challenge is not that automated analyses are wrong but that they are abundant.” You can get loads of analyses for cheap, and they will not all agree; how much they agree with you will depend on how you prompt the agents. They spin this as a positive, as it provides a way to rapidly explore how conclusions are sensitive to modeling choices. Need a sensitivity analysis? Get twenty different versions of the analysis in an hour and see how much they differ.