See also Pedagogy, LLM writing styles.
General overview:
Imundo, M. N., Watanabe, M., Potter, A. H., Gong, J., Arner, T., & McNamara, D. S. (2024). Expert thinking with generative chatbots. Journal of Applied Research in Memory and Cognition, 13(4), 465–484. doi:10.1037/mac0000199
A good overview of how experts think (in cognitive psychology terms), how experts could use LLMs to support them, and the considerations for novice use. Discusses how offloading low-level tasks to a tool can threaten expertise, since (a) experts become experts through deliberate practice of those low-level skills; (b) expertise involves examining the whole problem, including the low-level details; and (c) fields change over time, so engagement with the details is necessary to stay up-to-date. Gives theoretical arguments that automating tasks typically done by junior staff (e.g., new lawyers tediously reviewing contracts) may prevent them from becoming experts.
Specifically, how LLMs affect student learning of writing. For effect on writing style, see LLM writing styles.
Fan, Y., Tang, L., Le, H., Shen, K., Tan, S., Zhao, Y., Shen, Y., Li, X., & Gašević, D. (2025). Beware of metacognitive laziness: Effects of generative artificial intelligence on learning motivation, processes, and performance. British Journal of Educational Technology, 56(2), 489–530. doi:10.1111/bjet.13544
An experimental study of ESL students completing a writing task in several experimental conditions, such as with ChatGPT support, with a human tutor, or with a writing checklist. The task focused on revision, so the students worked to revise their writing to meet a rubric. Their actions (edits, clicks, chat interactions, etc.) were collected. The AI-assisted students produced better writing, but possibly because they were copying its generated text to cater to the scoring rubric. The students using AI spent much less time evaluating their writing than those using a checklist (which forced them to evaluate each item), making the authors worry that using ChatGPT might “optimise performance at the expense of developing genuine human skills”. They call this “metacognitive laziness”: the students outsource the metacognitive skills (of evaluating their work and determining how to improve it) to a tool, and show no sign of learning those skills.
Mahapatra, S. (2024). Impact of ChatGPT on ESL students’ academic writing skills: A mixed methods intervention study. Smart Learning Environments, 11. doi:10.1186/s40561-024-00295-9
An experimental study using ChatGPT to help ESL students learn writing. Students were trained to use it to get feedback on the content, organization, and style. Results showed higher scores (on a writing rubric) for the ChatGPT group in post-tests. Interviews with students showed they found the feedback helpful. However, the writing tasks were paragraphs of 150 words and the paper is vague about whether the post-tests were conducted with or without ChatGPT access, so it’s unclear if this tests if the students learned how to write without ChatGPT’s aid or if the graders just prefer ChatGPT-mediated output; and whether this generalizes to longer writing forms.
Oppenheimer, D. M., Cash, T. N., & Connell Pensky, A. E. (2025). You’ve got AI friend in me: LLMs as collaborative learning partners. doi:10.31219/osf.io/8q67u_v2
A similar experimental intervention in a large (154 students) intro course. Students used ChatGPT to help improve their argumentative essays, turned in their drafts and chat transcripts, and provided self-reflections on the revision process. They wrote five essays throughout the course, and performance on the final essay was evaluated before they got LLM feedback, to test if they had improved. Found significant improvements in writing. But there was no control group, so it’s unclear how ChatGPT compares to their progress without AI assistance.
Savelka, J., Agarwal, A., An, M., Bogart, C., & Sakr, M. (2023). Thrilled by your progress! Large language models (GPT-4) no longer struggle to pass assessments in higher education programming courses. In Proceedings of the 2023 ACM Conference on International Computing Education Research (Vol. 1, pp. 78–92). doi:10.1145/3568813.3600142
Evaluates GPT-4 (not 4o or any of the later, more coding-focused models) on undergrad Python course exercises. (These are online courses for certifications, not advanced CS courses.) Finds GPT-4 can likely earn a passing grade. I suspect this would be true of newer LLMs and our own undergraduate statistical computing course.
McDanel, B., & Novak, E. (2025). Designing LLM-resistant programming assignments: Insights and strategies for CS educators. In Proceedings of the 56th ACM Technical Symposium on Computer Science Education (Vol. 1, pp. 756–762). doi:10.1145/3641554.3701872
Tests GPT-4o and Claude 3.5 Sonnet on SIGCSE’s Nifty Problems, which are basically papers presenting clever CS homework assignments and projects. Claude does particularly well, succeeding at most assignments, though a few assignments were consistently hard for the LLMs. Figure 4 summarizes what LLMs were bad at: anything involving visual output (despite being multimodal, LLMs are not very good at interpreting images), tasks requiring sequences of many steps, and, oddly, very detailed and specific assignment prompts: “Assignments that limit the solution space by giving detailed, clear, and explicit instructions are challenging for LLMs. Open-ended and greenfield projects are easier.”
Prather, J., Reeves, B. N., Denny, P., Becker, B. A., Leinonen, J., Luxton-Reilly, A., Powell, G., Finnie-Ansley, J., & Santos, E. A. (2023). “It’s weird that it knows what I want”: Usability and interactions with Copilot for novice programmers. ACM Transactions on Computer-Human Interaction, 31(1), 1–31. doi:10.1145/3617367
Prather, J., Reeves, B. N., Leinonen, J., MacNeil, S., Randrianasolo, A. S., Becker, B. A., Kimmel, B., Wright, J., & Briggs, B. (2024). The widening gap: The benefits and harms of generative AI for novice programmers. In Proceedings of the 2024 ACM Conference on International Computing Education Research (Vol. 1, pp. 469–486). doi:10.1145/3632620.3671116
Fascinating pair of think-aloud interview studies with novice programmers using GitHub Copilot and ChatGPT to do a simple programming task. In the first, they observe students “shepherding” Copilot toward solutions, but sometimes “drifting” when Copilot generated “large blocks of unhelpful code” that they had to read and ultimately reject (or accept and then debug).
In the second, half the subjects, who knew what they wanted and had a rough idea of strategy, did quite well. Half, however, struggled greatly: they were derailed by constant interruptions from Copilot offering suggestions, sidetracked by suggestions that actually solved a different problem, got stuck in misunderstandings of the task and couldn’t get un-stuck, and so on. Most ultimately finished the task, but slowly. They conclude: “From the evidence presented above, it appears that most of these ten who struggled thought they understood more than they actually did. The patterns of behavior above describe how participants were often led along by GenAI such that each step was able to be rationalized as understanding, making it even more difficult for participants to assess their own learning.” They caution that studies saying novice programmers self-report finding AI useful may be misleading, because their own interviews show students being derailed by it and then claiming it was helpful, perhaps because the students don’t have the metacognitive skills to recognize the result was wrong.
Thorgeirsson, S., Ewen, T., & Su, Z. (2025). What can computer science educators learn from the failures of top-down pedagogy? In Proceedings of the 56th ACM Technical Symposium on Computer Science Education (Vol. 1, pp. 1127–1133). doi:10.1145/3641554.3701873
Creates a dichotomy of teaching approaches: bottom-up approaches start with the basic core skills and work up to integrate them, while top-down approaches start with the overall goal and work downward. Phonics, in which children learn to read by practicing sounding out individual words and then progressing to whole sentences, is bottom-up; whole-language reading, where students start reading entire sentences early and guess how to read unfamiliar words from context, is top-down. Whole-language reading was trendy until recently, but it turns out that all the empirical evidence suggests it doesn’t work. The authors worry that “approaches that rely heavily on large language models (LLMs) or ‘prompt programming’ run the risk of being the computer science equivalent of whole language, focusing heavily on end results without understanding the underlying mechanics”. They review several studies, including some cited above, that top-down LLM-based approaches may leave students with weak metacognitive skills. But, they note, the other part of education is engagement and motivation—whole-language reading is much more fun for teachers, because they get to read stories instead of sounding out words; and LLM-based education might be more fun because of the fancy new technology. And you can’t pick a pedagogy without considering whether it will motivate students and teachers.
O’Brien, G. (2025). How scientists use large language models to program. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. doi:10.1145/3706598.3713668
Survey and interview study of researchers at the University of Michigan who use LLMs (either chatbots or Copilot-style editor plugins) to write code for their research. Finds that a common use case is working with code or libraries the researcher doesn’t understand, such as when they work in a different language they’re unfamiliar with or work with code written by someone else. Most of the researchers don’t have a formal way to check if the generated code is correct, mainly just running it and seeing if the output looks right; during one interview, as the subject showed a chat log of their apparently successful use of ChatGPT, the interviewer noticed a bug they had missed through lack of testing and verification. The paper gives several other examples of missed errors in LLM code suggestions.
Several of the researchers also expressed misconceptions about ChatGPT, such as believing that it Googles the answer to their questions (at a time when ChatGPT did not have web access) or thinking it performs mathematical and logical calculations.
The paper doesn’t make the connection to critical thinking and metacognition, but the results suggest even expert users are not critical enough of AI output, and do not have effective strategies for testing and verifying their code.
Besides how students think and learn using AI, it’s important to consider broader questions about how LLMs and AI will affect social skills, society, and work more generally. Also, automation augmenting human skills is not a new problem, so there are useful historical parallels.
Hou, I., Man, O., Hamilton, K., Muthusekaran, S., Johnykutty, J., Zadeh, L., & MacNeil, S. (2025). "All roads lead to ChatGPT": How generative AI is eroding social interactions and student learning communities. In Proceedings of the 2025 Innovation and Technology in Computer Science Education Conference. https://arxiv.org/abs/2504.09779
Through interviews with CS undergrads, notes a tendency for students to seek help from ChatGPT rather than peers, which some interviewees suggest reduces their social interactions, makes them feel more isolated, and reduces opportunities for peers to be mentors or share experiences, ultimately reducing their sense of belonging in the field.
[To read] Leslie Paul Thiele (2025), Human Agency, Artificial Intelligence, and the Attention Economy: The Case for Digital Distancing. Springer. doi:10.1007/978-3-031-82086-1
I’ve only read the chapter on deskilling, which argues that historical parallels (e.g., the shift from artisan work to factory labor) suggest that AI will not simply augment our skills, making us more powerful; it will replace our skills, making us dependent much like the people in Wall-E.
Simkute, A., Tankelevitch, L., Kewenig, V., Scott, A. E., Sellen, A., & Rintel, S. (2025). Ironies of generative AI: Understanding and mitigating productivity loss in human-AI interaction. International Journal of Human–Computer Interaction, 41(5), 2898–2919. doi:10.1080/10447318.2024.2405782
A paper by human-computer interaction researchers who note that some of the problems in the studies above (that using Copilot to help write code actually creates new problems, like dealing with its interruptions and incorrect suggestions) are problems that have been well-known for years in human-computer interaction. Autopilots, industrial automation, and many other systems have run into similar problems. Reviews a number of common themes from past automation research:
They give examples of how each of these has happened in past automation and in generative AI research, then discuss how the field of HCI offers approaches to better design generative AI tools to avoid these issues. Basically, “AI designers should remember that our field exists instead of repeating every mistake from scratch.”