LLMs in education

General overview

Imundo, M. N., Watanabe, M., Potter, A. H., Gong, J., Arner, T., & McNamara, D. S. (2024). Expert thinking with generative chatbots. Journal of Applied Research in Memory and Cognition, 13(4), 465–484. doi:10.1037/mac0000199

A good overview of how experts think (in cognitive psychology terms), how experts could use LLMs to support them, and the considerations for novice use. Discusses how offloading low-level tasks to a tool can threaten expertise, since (a) experts become experts through deliberate practice of those low-level skills; (b) expertise involves examining the whole problem, including the low-level details; and (c) fields change over time, so engagement with the details is necessary to stay up-to-date. Gives theoretical arguments that automating tasks typically done by junior staff (e.g., new lawyers tediously reviewing contracts) may prevent them from becoming experts.
Yan, L., Greiff, S., Lodge, J. M., & Gašević, D. (2025). Distinguishing performance gains from learning when using generative AI. Nature Reviews Psychology, 4, 435–436. doi:10.1038/s44159-025-00467-5

Opinion piece noting that empirical studies measuring performance – how well students complete some task – do not necessarily measure learning, so research into the effect of AI on education should be careful about what they’re measuring. There’s every reason to think students can perform well even without much learning (e.g., by cramming for an exam and forgetting it all a day later). Points to some of the research below to note that AI may boost performance while diminishing learning by inhibiting metacognition and reducing engagement with tasks.
Thorgeirsson, S., Ewen, T., & Su, Z. (2025). What can computer science educators learn from the failures of top-down pedagogy? In Proceedings of the 56th ACM Technical Symposium on Computer Science Education (Vol. 1, pp. 1127–1133). doi:10.1145/3641554.3701873

Creates a dichotomy of teaching approaches: bottom-up approaches start with the basic core skills and work up to integrate them, while top-down approaches start with the overall goal and work downward. Phonics, in which children learn to read by practicing sounding out individual words and then progressing to whole sentences, is bottom-up; whole-language reading, where students start reading entire sentences early and guess how to read unfamiliar words from context, is top-down. Whole-language reading was trendy until recently, but it turns out that all the empirical evidence suggests it doesn’t work. The authors worry that “approaches that rely heavily on large language models (LLMs) or ‘prompt programming’ run the risk of being the computer science equivalent of whole language, focusing heavily on end results without understanding the underlying mechanics”. (While the paper is about teaching CS, the argument should apply to almost any field.) They review several studies, including some cited below, that top-down LLM-based approaches may leave students with weak metacognitive skills. But, they note, the other part of education is engagement and motivation—whole-language reading is much more fun for teachers, because they get to read stories instead of sounding out words; and LLM-based education might be more fun because of the fancy new technology. And you can’t pick a pedagogy without considering whether it will motivate students and teachers.

In writing

Specifically, how LLMs affect student learning of writing. For effect on writing style, see LLM writing styles.

Fan, Y., Tang, L., Le, H., Shen, K., Tan, S., Zhao, Y., Shen, Y., Li, X., & Gašević, D. (2025). Beware of metacognitive laziness: Effects of generative artificial intelligence on learning motivation, processes, and performance. British Journal of Educational Technology, 56(2), 489–530. doi:10.1111/bjet.13544

An experimental study of ESL students completing a writing task in several experimental conditions, such as with ChatGPT support, with a human tutor, or with a writing checklist. The task focused on revision, so the students worked to revise their writing to meet a rubric. Their actions (edits, clicks, chat interactions, etc.) were collected. The AI-assisted students produced better writing, but possibly because they were copying its generated text to cater to the scoring rubric. The students using AI spent much less time evaluating their writing than those using a checklist (which forced them to evaluate each item), making the authors worry that using ChatGPT might “optimise performance at the expense of developing genuine human skills”. They call this “metacognitive laziness”: the students outsource the metacognitive skills (of evaluating their work and determining how to improve it) to a tool, and show no sign of learning those skills.
Mahapatra, S. (2024). Impact of ChatGPT on ESL students’ academic writing skills: A mixed methods intervention study. Smart Learning Environments, 11. doi:10.1186/s40561-024-00295-9

An experimental study using ChatGPT to help ESL students learn writing. Students were trained to use it to get feedback on the content, organization, and style. Results showed higher scores (on a writing rubric) for the ChatGPT group in post-tests. Interviews with students showed they found the feedback helpful. However, the writing tasks were paragraphs of 150 words and the paper is vague about whether the post-tests were conducted with or without ChatGPT access, so it’s unclear if this tests if the students learned how to write without ChatGPT’s aid or if the graders just prefer ChatGPT-mediated output; and whether this generalizes to longer writing forms.
Oppenheimer, D. M., Cash, T. N., & Connell Pensky, A. E. (2025). You’ve got AI friend in me: LLMs as collaborative learning partners. doi:10.31219/osf.io/8q67u_v2

A similar experimental intervention in a large (154 students) intro course. Students used ChatGPT to help improve their argumentative essays, turned in their drafts and chat transcripts, and provided self-reflections on the revision process. They wrote five essays throughout the course, and performance on the final essay was evaluated before they got LLM feedback, to test if they had improved. Found significant improvements in writing. But there was no control group, so it’s unclear how ChatGPT compares to their progress without AI assistance.

For coding

Savelka, J., Agarwal, A., An, M., Bogart, C., & Sakr, M. (2023). Thrilled by your progress! Large language models (GPT-4) no longer struggle to pass assessments in higher education programming courses. In Proceedings of the 2023 ACM Conference on International Computing Education Research (Vol. 1, pp. 78–92). doi:10.1145/3568813.3600142

Evaluates GPT-4 (not 4o or any of the later, more coding-focused models) on undergrad Python course exercises. (These are online courses for certifications, not advanced CS courses.) Finds GPT-4 can likely earn a passing grade. I suspect this would be true of newer LLMs and our own undergraduate statistical computing course.
McDanel, B., & Novak, E. (2025). Designing LLM-resistant programming assignments: Insights and strategies for CS educators. In Proceedings of the 56th ACM Technical Symposium on Computer Science Education (Vol. 1, pp. 756–762). doi:10.1145/3641554.3701872

Tests GPT-4o and Claude 3.5 Sonnet on SIGCSE’s Nifty Problems, which are basically papers presenting clever CS homework assignments and projects. Claude does particularly well, succeeding at most assignments, though a few assignments were consistently hard for the LLMs. Figure 4 summarizes what LLMs were bad at: anything involving visual output (despite being multimodal, LLMs are not very good at interpreting images), tasks requiring sequences of many steps, and, oddly, very detailed and specific assignment prompts: “Assignments that limit the solution space by giving detailed, clear, and explicit instructions are challenging for LLMs. Open-ended and greenfield projects are easier.”
Prather, J., Reeves, B. N., Denny, P., Becker, B. A., Leinonen, J., Luxton-Reilly, A., Powell, G., Finnie-Ansley, J., & Santos, E. A. (2023). “It’s weird that it knows what I want”: Usability and interactions with Copilot for novice programmers. ACM Transactions on Computer-Human Interaction, 31(1), 1–31. doi:10.1145/3617367

Prather, J., Reeves, B. N., Leinonen, J., MacNeil, S., Randrianasolo, A. S., Becker, B. A., Kimmel, B., Wright, J., & Briggs, B. (2024). The widening gap: The benefits and harms of generative AI for novice programmers. In Proceedings of the 2024 ACM Conference on International Computing Education Research (Vol. 1, pp. 469–486). doi:10.1145/3632620.3671116

Fascinating pair of think-aloud interview studies with novice programmers using GitHub Copilot and ChatGPT to do a simple programming task. In the first, they observe students “shepherding” Copilot toward solutions, but sometimes “drifting” when Copilot generated “large blocks of unhelpful code” that they had to read and ultimately reject (or accept and then debug).

In the second, half the subjects, who knew what they wanted and had a rough idea of strategy, did quite well. Half, however, struggled greatly: they were derailed by constant interruptions from Copilot offering suggestions, sidetracked by suggestions that actually solved a different problem, got stuck in misunderstandings of the task and couldn’t get un-stuck, and so on. Most ultimately finished the task, but slowly. They conclude: “From the evidence presented above, it appears that most of these ten who struggled thought they understood more than they actually did. The patterns of behavior above describe how participants were often led along by GenAI such that each step was able to be rationalized as understanding, making it even more difficult for participants to assess their own learning.” They caution that studies saying novice programmers self-report finding AI useful may be misleading, because their own interviews show students being derailed by it and then claiming it was helpful, perhaps because the students don’t have the metacognitive skills to recognize the result was wrong.
O’Brien, G. (2025). How scientists use large language models to program. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. doi:10.1145/3706598.3713668

Survey and interview study of researchers at the University of Michigan who use LLMs (either chatbots or Copilot-style editor plugins) to write code for their research. Finds that a common use case is working with code or libraries the researcher doesn’t understand, such as when they work in a different language they’re unfamiliar with or work with code written by someone else. Most of the researchers don’t have a formal way to check if the generated code is correct, mainly just running it and seeing if the output looks right; during one interview, as the subject showed a chat log of their apparently successful use of ChatGPT, the interviewer noticed a bug they had missed through lack of testing and verification. The paper gives several other examples of missed errors in LLM code suggestions.

Several of the researchers also expressed misconceptions about ChatGPT, such as believing that it Googles the answer to their questions (at a time when ChatGPT did not have web access) or thinking it performs mathematical and logical calculations.

The paper doesn’t make the connection to critical thinking and metacognition, but the results suggest even expert users are not critical enough of AI output, and do not have effective strategies for testing and verifying their code.
Lucchetti, F., Wu, Z., Guha, A., Feldman, M. Q., & Anderson, C. J. (2025). Substance beats style: Why beginning students fail to code with LLMs. In L. Chiruzzo, A. Ritter, & L. Wang (Eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol. 1, pp. 8541–8610). Association for Computational Linguistics. doi:10.18653/v1/2025.naacl-long.433

Reviews a dataset of student interactions with LLMs to solve programming problems (“write a function to…”). Goal is to distinguish between two explanations for students failing to get good results: (a) students don’t have the technical vocabulary to state what they want, or (b) they don’t provide the LLM sufficient information to solve the problem. Finds support for (b) through various clever analyses of the prompt logs, such as by swapping out the nonstandard vocabulary students used for the technically correct terms. That didn’t help much, but they found that unsuccessful prompts were typically missing some of the clues necessary to solve the problem. In other words, students didn’t realize what knowledge is needed to do the task, so teaching them prompt engineering that focuses on wording will not be sufficient alone.

In statistics courses

Al Labadi, L., & Ly, A. (2025). Enhancing statistics education through Project‐Based Learning (PBL) and the emergence of ChatGPT. Teaching Statistics, 47(3), 200–218. doi:10.1111/test.12405

Discussion of two project-based statistics courses at the University of Toronto Mississauga. In the first course, they noticed several groups using AI “to help write the introduction, discussion, limitations, and conclusion”, so they switched to oral exams in the second course, interrogating each group about their project and their decision-making. 43% of students admitted to using AI in an anonymous survey, despite it being officially forbidden. There’s a lot of other discussion here about student impressions of the projects and the logistical details of project-based learning, but the generative AI parts are interesting; ultimately they say they should stop forbidding AI and adopt strategies like requiring draft submissions and revisions to ensure students are supervising AI and learning from it instead of just copying its output. They also intend to introduce more coding instruction, because students “tend to rely on generative AI codes, which occasionally provide incorrect suggestions or provide content that is inconsistent with the course material.”

Broader perspectives

Besides how students think and learn using AI, it’s important to consider broader questions about how LLMs and AI will affect social skills, society, and work more generally. Also, automation augmenting human skills is not a new problem, so there are useful historical parallels.

Hou, I., Man, O., Hamilton, K., Muthusekaran, S., Johnykutty, J., Zadeh, L., & MacNeil, S. (2025). "All roads lead to ChatGPT": How generative AI is eroding social interactions and student learning communities. In Proceedings of the 2025 Innovation and Technology in Computer Science Education Conference. https://arxiv.org/abs/2504.09779

Through interviews with CS undergrads, notes a tendency for students to seek help from ChatGPT rather than peers, which some interviewees suggest reduces their social interactions, makes them feel more isolated, and reduces opportunities for peers to be mentors or share experiences, ultimately reducing their sense of belonging in the field.
[To read] Leslie Paul Thiele (2025), Human Agency, Artificial Intelligence, and the Attention Economy: The Case for Digital Distancing. Springer. doi:10.1007/978-3-031-82086-1

I’ve only read the chapter on deskilling, which argues that historical parallels (e.g., the shift from artisan work to factory labor) suggest that AI will not simply augment our skills, making us more powerful; it will replace our skills, making us dependent much like the people in Wall-E.
Simkute, A., Tankelevitch, L., Kewenig, V., Scott, A. E., Sellen, A., & Rintel, S. (2025). Ironies of generative AI: Understanding and mitigating productivity loss in human-AI interaction. International Journal of Human–Computer Interaction, 41(5), 2898–2919. doi:10.1080/10447318.2024.2405782

A paper by human-computer interaction researchers who note that some of the problems in the studies above (that using Copilot to help write code actually creates new problems, like dealing with its interruptions and incorrect suggestions) are problems that have been well-known for years in human-computer interaction. Autopilots, industrial automation, and many other systems have run into similar problems. Reviews a number of common themes from past automation research:
1. Production-to-evaluation shift: When a task is automated, humans switch from performing it to supervising it. This turns out to be surprisingly hard to do; for example, pilots and air traffic controllers now spend lots of time supervising and understanding autopilots and similar systems.
2. Reduced situational awareness: Because you’re more passive when supervising an automated system, you’re less aware of the situation. This happened to air traffic controllers, for instance, as automated systems eroded their ability to rapidly understand situations and detect problems. This is made worse when the automation is more sophisticated and more complicated.
3. Complacency: People monitoring a mostly reliable automated system tend to become complacent. See, for instance, every time a Tesla on Autopilot has driven straight into a fire truck.
4. Workflow restructuring: Adding automation requires new tasks—managing the automation, adjusting its output, etc. This can derail experts from the task at hand. It may also eliminate some sources of feedback. (One example is in a paper mill, where automation put operators in a remote control room where they could no longer smell and hear the process, which turned out to be a valuable source of information for expert operators.)
5. Interruption: Automation tends to interrupt users at unexpected times, derailing their thinking.
6. Task-complexity polarization: “Automation often makes easy tasks easier but fails to reduce the workload during cognitively demanding tasks, and in fact, often makes them harder”.
They give examples of how each of these has happened in past automation and in generative AI research, then discuss how the field of HCI offers approaches to better design generative AI tools to avoid these issues. Basically, “AI designers should remember that our field exists instead of repeating every mistake from scratch.”