AI Evaluation: Report

MagicSchool misses the mark on grade-level text up to 65% of the time

February 2026

Rate of Inappropriate Grade Level Text Generation by Topic — Science: 65% inappropriate, Social Studies: 13.3% inappropriate

AI in education is no longer theoretical, but an extremely prevalent¹ part of students' and teachers' lives. One of the most popular AI platforms for education is MagicSchool, which boasts millions of users and more than 10,000 schools around the world. Until now, most discussions about MagicSchool have highlighted its rapid adoption rate, and notably lack scientific evaluation of the quality of their capabilities. Our new analysis fills this gap by applying a Learning Commons evaluator, which was developed in partnership with experts in learning science and pedagogy, to MagicSchool's text generation abilities.

Key Findings

Science text is inappropriate for the target grade level more than half the time
MagicSchool's Informational Texts tool generates text for Science class that is inappropriate for the provided grade level more than half (65%) of the time.

Grade level appropriateness varies significantly by subject
MagicSchool's grade level appropriateness varies by subject — the platform performs significantly better at generating text for Social Studies than it does for Science.

About one third of generated text falls outside the target grade level
Across those two subjects, MagicSchool generated text outside of the target grade level about one third (34%) of the time.

Methodology

This assessment intentionally focuses on 3rd grade reading tasks. While these findings should not be over-generalized to all grades or subjects, they offer a high-signal stress test at a critical literacy inflection point.

Examples

To illustrate what "inappropriate for the target grade level" looks like in practice, below are two passages MagicSchool generated for 3rd graders that were evaluated at the 6th to 8th grade reading level.

Patterns in Motion

Expected: 3rd Grade → Actual: 6th–8th Grade

What are patterns in motion? Patterns in motion happen when an object moves the same way again and again, like a swing that goes back and forth, a toy car that rolls the same distance after each push, or the seasons that repeat each year; by watching these repeats, scientists can often predict what will happen next. How do forces help create patterns? Forces — pushes and pulls — make things start, stop, or change direction, and when the same forces act the same way each time (for example, gravity pulling a ball down or a constant push on a skateboard), the motion often follows a pattern. Why do patterns matter? Seeing patterns in motion helps us plan and solve problems: engineers use them to design safer playgrounds and teachers use them to explain how swings and slides work; observing carefully and measuring how things move lets us use past motion to guess future motion.

There is a profound mismatch between the simplicity of the concepts (appropriate for grades 2–3) and the extreme complexity of the sentence structure. The text is unreadable for 3rd graders without teacher support. Sentences average over 41 words, and use semicolons, colons, and parentheses that most students would not encounter until middle school.

Full evaluator analysis

Quantitative Analysis: The text has a word count of 125. The Flesch-Kincaid Grade Level calculates to approximately 18.1. This extremely high score is a direct result of the text's sentence structure, which consists of only three sentences with an average length of over 41 words.

Qualitative Analysis: The text's purpose (explaining a science concept) is clear and its overall Q&A structure is simple. It uses familiar examples (swings, toy cars) and requires only some discipline-specific knowledge (force, gravity), making its purpose and knowledge demands Moderately Complex at most. However, the Language Features are Exceedingly Complex. The sentences are exceptionally long and convoluted, using advanced punctuation (semicolons, colons, parentheses) that makes the text very difficult to parse for most readers.

Background Knowledge: The scientific concepts presented — patterns in motion, pushes and pulls (forces), and using patterns to predict future motion — align directly with the science curriculum for grades 2–3.

Synthesis: There is a profound mismatch between the simplicity of the concepts (appropriate for grades 2–3) and the extreme complexity of the sentence structure (which the Flesch-Kincaid score places at a post-secondary level). For independent reading, a student would need to be able to decode the complex syntax, which is unlikely before middle school. The text can only be used effectively as a read-aloud where the teacher breaks down the sentences and concepts.

Variation within a Species

Expected: 3rd Grade → Actual: 6th–8th Grade

Animals and plants that belong to the same species can look or act a little differently from one another because of variation; for example, some ladybugs have more spots than others, some maple trees have darker leaves, and some children inherit curly hair while others have straight hair — these differences come from traits passed down from parents and from the environment where an organism lives, and they help a species survive because certain traits can make it easier to find food, hide from predators, or live in a particular place.

The entire passage is a single 95-word sentence. While the concepts and vocabulary are accessible to upper elementary students, the compound-complex syntax — using semicolons, em-dashes, and multiple dependent clauses — makes it only suitable for independent reading at the 6th–8th grade level.

Full evaluator analysis

Quantitative Analysis: The text has a word count of 95, which is too low to be assessed by the provided word count bands. The text is a single sentence, which results in a Flesch-Kincaid Grade Level of 39.6. This score is artificially and unreliably high because the formula heavily penalizes extremely long sentences. Therefore, the quantitative metrics are not useful for this specific text, and the evaluation must rely on qualitative measures.

Qualitative Analysis: The text structure is Moderately Complex — while it is just one sentence, its internal logic is clear, using explicit transitional phrases. However, the sheer length and use of semicolons and an em-dash make the syntax challenging for younger readers. The text uses mostly familiar vocabulary but introduces key scientific terms such as “species,” “variation,” “inherit,” “traits,” “organism,” and “predators.” The primary complexity comes from the compound-complex sentence structure rather than the vocabulary itself.

Background Knowledge: Students typically begin learning about animal and plant characteristics in early elementary school. The concepts of inherited traits and survival advantages are formally introduced in the 4th and 5th grades. By middle school (grades 6–8), students are expected to have a firm grasp of these ideas.

Synthesis: The primary barrier to comprehension is the complex syntax of the single, 95-word sentence. While the concepts and most of the vocabulary are accessible to upper elementary students (grades 4–5), the sentence structure is more appropriate for a middle school reader (grades 6–8). For independent reading, the text is best suited for the 6–8 grade band.

MagicSchool Text Generation

Topics were drawn from two uncontroversial, widely-adopted standards frameworks: NGSS (Next Generation Science Standards) for science topics [n=40] and C3 Framework / State Social Studies Standards for social studies topics [n=60].

See topics.txt for a full topic list. Once the topics were gathered, we asked MagicSchool directly what tool we should use to generate text as a 3rd grade teacher. We selected the top recommendation: Informational Texts.

The parameters we used for Informational Texts generation were as follows (also depicted in the screenshot below):

Grade level: 3rd grade
Text Length: 1 paragraph
Informational Text Type: Expository
Topic: Create an informational reading passage for [SUBJECT] in [CATEGORY] about the following topic: [DESCRIPTION]. The passage should be appropriate for independent reading.

MagicSchool Informational Text Tool Example

We manually collected 100 samples across different topics and logged the results in CSV format. The data schema for this table is available on Github.

Learning Commons Evaluator

Learning Commons evaluators are developed and validated in collaboration with leading organizations in learning science and pedagogy, including Student Achievement Partners, CAST, and Achievement Network (ANet), using ground-truth datasets that reflect expert teaching and learning principles. In testing, their Grade Level Appropriateness evaluator was 89% accurate against the 2nd to 3rd grade expert-validated dataset.

Using the MagicSchool CSV as an input, the Grade Level Appropriateness evaluator is implemented as a Python script (run_grade_level_appropriateness_evaluator.py) that builds a LangChain evaluation chain powered by Google's Gemini-2.5-pro model.

For each piece of generated text, the chain applies a structured, multi-step prompt that guides the LLM through four stages of analysis: a quantitative assessment of word count and Flesch-Kincaid readability, a qualitative complexity rubric examining text structure, language features, purpose, and knowledge demands, a background knowledge assessment, and finally a synthesis step that reconciles these signals into a target grade band (K-1, 2-3, 4-5, 6-8, 9-10, or 11-CCR). Full results are available in the repository here.

Conclusion

MagicSchool generated text outside of the target grade level about one third of the time, with accuracy varying dramatically by subject; from 86.7% for Social Studies to just 35% for Science. These gaps would not surface in a product demo or pilot. They emerge only when a tool is tested at the task level, against specific grade-level expectations, across a meaningful number of samples. Today, most districts adopting AI tools do not have access to this kind of independent, task-specific analysis. This evaluation is part of an ongoing effort to change that. If your organization would benefit from this kind of assessment, reach out at hello@ourdojo.org.