The Big Picture
As ChatGPT becomes increasingly accessible, teachers are exploring its potential for creating lesson plans. But how reliable is this AI-generated content for foreign language instruction? This study put ChatGPT 4.0 to the test, generating 50 lesson plans across five increasingly specific prompts to measure quality, consistency, and potential pitfalls.
The results reveal a nuanced picture: while ChatGPT can produce generally aligned lesson plans, simply adding more detail to a prompt doesn't guarantee better output. More concerning, the AI's training on historical texts introduced outdated teaching methods into modern lesson plans — methods that research abandoned decades ago.
Three Key Findings
More Detail ≠ Better Output
Adding context and specificity to prompts did not produce a linear increase in lesson plan quality. Including a lesson plan template actually decreased scores in some areas, while adding a scoring rubric checklist yielded the highest results.
High Variability Is Inherent
The same prompt produced dramatically different outputs. Scores from identical prompts varied by as many as 5 or more points out of 25, meaning some outputs missed over 25% of criteria while others exceeded 90% — purely by chance.
Historical Biases Surface
ChatGPT repeatedly generated lesson plans reflecting the audio-lingual method — a behaviorist approach from the 1970s involving scripted dialogues and rote repetition — practices that modern research has long moved beyond.
How the Study Worked
The researchers designed five prompts that progressively added layers of specificity, each building on the previous one. Each prompt was entered into ChatGPT 10 times in separate chats, yielding 50 lesson plans total. Every output was scored against a 25-point rubric aligned with the edTPA — the performance-based assessment required for teacher licensure in many U.S. states.
Basic Prompt
A general lesson plan request specifying grade level, language, proficiency level, class size, and three lesson objectives about Costa Rican restaurants.
+ Proficiency Guidelines
Added a directive to define the novice level of proficiency based on the ACTFL Proficiency Guidelines.
+ Lesson Plan Template
Added a specific lesson plan format (the template used in pre-service teacher training) for ChatGPT to follow.
+ World-Readiness Standards
Added the condition that the lesson plan should address ACTFL World-Readiness Standards for language learning.
+ Scoring Rubric Checklist
Added a detailed checklist of all 25 components the lesson plan should include — the most specific prompt in the study.
Quality Scores by Prompt
Each lesson plan was scored on 25 criteria. The relationship between prompt specificity and quality was not linear — the template in P.3 actually reduced scores, while the rubric checklist in P.5 produced the highest average.
Key Insight: When the lesson plan template was introduced in P.3, scores for the warm-up and teacher input portions dropped sharply. The template's "hook" terminology was interpreted by ChatGPT as a teacher-centered activity, producing outputs that lacked student interaction — despite the template explicitly instructing otherwise.
The Variability Problem
One of the most striking findings was the wide range of scores produced from identical prompts. This variability underscores that AI outputs are inherently non-deterministic — even when using the same prompt, teachers may get dramatically different quality levels from one attempt to the next.
Where ChatGPT Excels — and Falls Short
Not all lesson plan components were treated equally. Some elements appeared in nearly every output regardless of prompt, while others were almost entirely absent without explicit instruction. This heatmap shows how frequently key rubric categories were met across each prompt group.
The culture gap: Cultural connections — helping students relate their own perspectives to those of the target culture — appeared in only 5 of the 50 lesson plans for the first four prompts. Even when the rubric was provided in P.5, this component appeared only 50% of the time. This mirrors a longstanding challenge in language education where culture is treated as secondary to linguistic skills.
The Historical Bias Problem
Perhaps the most significant finding was that ChatGPT's outputs frequently reflected outdated pedagogical methods. Because the model was trained on vast collections of text spanning decades, it absorbed associations with teaching practices that are no longer considered best practice.
What ChatGPT Produced
Lesson plans featuring scripted dialogues, rote repetition, and flashcard-based pronunciation drills — hallmarks of the audio-lingual method popular in the 1970s. In some iterations, students were directed to rehearse and repeat pre-scripted role-play scripts.
What Research Recommends
Modern communicative approaches emphasize authentic interaction, meaningful context, and student-generated language. The focus has shifted from memorization to real communication in culturally grounded situations.
This finding has broad implications beyond language education. Any field where practice has evolved significantly over time — medicine, science education, counseling — may encounter similar biases when using AI trained on historical texts to generate current instructional content.
What This Means for Teachers
Include Scoring Criteria
Adding a detailed checklist of what the lesson plan should contain (P.5) yielded the highest and most consistent scores. Tell the AI exactly what components to include.
Always Review for Bias
AI literacy is essential. Critically evaluate generated content for outdated methods, especially scripted dialogues, rote memorization, and activities lacking cultural connection.
Don't Assume More = Better
Simply adding a lesson plan template reduced quality in some areas. Specialized terminology in templates may be misinterpreted by the AI if not fully explained.
Generate Multiple Outputs
Given the inherent variability, one output is never sufficient. Generate several versions and select the best elements from each, then refine through follow-up prompting.
Dornburg, A. & Davin, K. J. (2025). ChatGPT in foreign language lesson plan creation: Trends, variability, and historical biases. ReCALL, 37(3), 332–347.
Read the Full Article →