Research Spotlight

ChatGPT in Foreign Language Lesson Plan Creation

How well does AI generate lesson plans for language teachers? A study of prompt specificity, output variability, and the historical biases embedded in AI-generated instructional content.

← Back to All Research

The Big Picture

As ChatGPT becomes increasingly accessible, teachers are exploring its potential for creating lesson plans. But how reliable is this AI-generated content for foreign language instruction? This study put ChatGPT 4.0 to the test, generating 50 lesson plans across five increasingly specific prompts to measure quality, consistency, and potential pitfalls.

The results reveal a nuanced picture: while ChatGPT can produce generally aligned lesson plans, simply adding more detail to a prompt doesn't guarantee better output. More concerning, the AI's training on historical texts introduced outdated teaching methods into modern lesson plans — methods that research abandoned decades ago.

Three Key Findings

01

More Detail ≠ Better Output

Adding context and specificity to prompts did not produce a linear increase in lesson plan quality. Including a lesson plan template actually decreased scores in some areas, while adding a scoring rubric checklist yielded the highest results.

02

High Variability Is Inherent

The same prompt produced dramatically different outputs. Scores from identical prompts varied by as many as 5 or more points out of 25, meaning some outputs missed over 25% of criteria while others exceeded 90% — purely by chance.

03

Historical Biases Surface

ChatGPT repeatedly generated lesson plans reflecting the audio-lingual method — a behaviorist approach from the 1970s involving scripted dialogues and rote repetition — practices that modern research has long moved beyond.

How the Study Worked

The researchers designed five prompts that progressively added layers of specificity, each building on the previous one. Each prompt was entered into ChatGPT 10 times in separate chats, yielding 50 lesson plans total. Every output was scored against a 25-point rubric aligned with the edTPA — the performance-based assessment required for teacher licensure in many U.S. states.

P1

Basic Prompt

A general lesson plan request specifying grade level, language, proficiency level, class size, and three lesson objectives about Costa Rican restaurants.

P2

+ Proficiency Guidelines

Added a directive to define the novice level of proficiency based on the ACTFL Proficiency Guidelines.

P3

+ Lesson Plan Template

Added a specific lesson plan format (the template used in pre-service teacher training) for ChatGPT to follow.

P4

+ World-Readiness Standards

Added the condition that the lesson plan should address ACTFL World-Readiness Standards for language learning.

P5

+ Scoring Rubric Checklist

Added a detailed checklist of all 25 components the lesson plan should include — the most specific prompt in the study.

Quality Scores by Prompt

Each lesson plan was scored on 25 criteria. The relationship between prompt specificity and quality was not linear — the template in P.3 actually reduced scores, while the rubric checklist in P.5 produced the highest average.

Average Score by Prompt Level
Out of 25 possible points. The gold line marks the target score.
P.1 Basic
20.1
25 (Target)
20.1
P.2 + Prof.
21.2
21.2
P.3 + Tmpl.
19.8
19.8
P.4 + Stds.
19.6
19.6
P.5 + Rubric
21.6
21.6

Key Insight: When the lesson plan template was introduced in P.3, scores for the warm-up and teacher input portions dropped sharply. The template's "hook" terminology was interpreted by ChatGPT as a teacher-centered activity, producing outputs that lacked student interaction — despite the template explicitly instructing otherwise.

The Variability Problem

One of the most striking findings was the wide range of scores produced from identical prompts. This variability underscores that AI outputs are inherently non-deterministic — even when using the same prompt, teachers may get dramatically different quality levels from one attempt to the next.

Score Range by Prompt
Each dot represents one of the 10 outputs. The large dot is the mean score. Bars show the full range.
P.1
P.2
P.3
P.4
P.5
15 17 19 21 23 25

Where ChatGPT Excels — and Falls Short

Not all lesson plan components were treated equally. Some elements appeared in nearly every output regardless of prompt, while others were almost entirely absent without explicit instruction. This heatmap shows how frequently key rubric categories were met across each prompt group.

Component Presence by Prompt Level
Frequency out of 10 outputs meeting each criterion. Darker green = more consistent.
Component
P.1
P.2
P.3
P.4
P.5
Meaningful context
10
10
10
10
10
Lesson objectives
10
10
10
10
10
Age-appropriate activities
10
10
10
10
10
Teacher input aligns w/ objectives
10
10
10
10
10
Student engagement/interaction
8
8
5
5
8
Warm-up activates prior knowledge
8
7
3
3
5
Closure checks objectives
7
7
6
6
8
Cultural connections
1
1
2
2
5
ACTFL standards integrated
1
2
1
3
5
Interpersonal communication
6
5
4
4
6
9–10 (strong)
7–8
5–6
3–4
1–2 (weak)

The culture gap: Cultural connections — helping students relate their own perspectives to those of the target culture — appeared in only 5 of the 50 lesson plans for the first four prompts. Even when the rubric was provided in P.5, this component appeared only 50% of the time. This mirrors a longstanding challenge in language education where culture is treated as secondary to linguistic skills.

The Historical Bias Problem

Perhaps the most significant finding was that ChatGPT's outputs frequently reflected outdated pedagogical methods. Because the model was trained on vast collections of text spanning decades, it absorbed associations with teaching practices that are no longer considered best practice.

What ChatGPT Produced

Lesson plans featuring scripted dialogues, rote repetition, and flashcard-based pronunciation drills — hallmarks of the audio-lingual method popular in the 1970s. In some iterations, students were directed to rehearse and repeat pre-scripted role-play scripts.

What Research Recommends

Modern communicative approaches emphasize authentic interaction, meaningful context, and student-generated language. The focus has shifted from memorization to real communication in culturally grounded situations.

This finding has broad implications beyond language education. Any field where practice has evolved significantly over time — medicine, science education, counseling — may encounter similar biases when using AI trained on historical texts to generate current instructional content.

What This Means for Teachers

Include Scoring Criteria

Adding a detailed checklist of what the lesson plan should contain (P.5) yielded the highest and most consistent scores. Tell the AI exactly what components to include.

Always Review for Bias

AI literacy is essential. Critically evaluate generated content for outdated methods, especially scripted dialogues, rote memorization, and activities lacking cultural connection.

Don't Assume More = Better

Simply adding a lesson plan template reduced quality in some areas. Specialized terminology in templates may be misinterpreted by the AI if not fully explained.

Generate Multiple Outputs

Given the inherent variability, one output is never sufficient. Generate several versions and select the best elements from each, then refine through follow-up prompting.

Dornburg, A. & Davin, K. J. (2025). ChatGPT in foreign language lesson plan creation: Trends, variability, and historical biases. ReCALL, 37(3), 332–347.

Read the Full Article →