ChatGPT's Mathematical Reasoning Shows Human-Like Flaws

A new study from the Hebrew University of Jerusalem and the University of Cambridge reveals that ChatGPT can exhibit unpredictable and sometimes flawed reasoning when solving mathematical problems. Researchers found the AI model improvised solutions and made human-like errors, suggesting its reliability in educational settings requires careful consideration.

The experiment tested the AI with an ancient Greek geometry puzzle, finding that instead of recalling known answers from its training data, it often generated its own, sometimes incorrect, lines of reasoning. This behavior highlights both potential risks and unique learning opportunities for students using AI tools.

Key Takeaways

Researchers tested ChatGPT-4 with Plato's classic "doubling the square" geometry problem.
The AI consistently chose modern algebraic methods over the well-known classical geometric solution.
ChatGPT made a significant error, incorrectly claiming a geometric solution was impossible for a related problem.
The study suggests ChatGPT's problem-solving is not just memory retrieval but involves a form of on-the-fly reasoning that can be flawed.
These findings indicate that while AI can be a powerful educational tool, it requires critical evaluation from students and educators.

Testing AI with an Ancient Problem

Researchers sought to understand the nature of ChatGPT's mathematical knowledge. They questioned whether the large language model (LLM) simply retrieves information it has stored or if it can generate new solutions through a reasoning process.

To investigate this, they turned to a problem first documented by Plato around 385 BCE in his dialogue, Meno. In the text, Socrates guides an uneducated boy to solve the "doubling the square" problem, demonstrating that knowledge can be drawn out through questioning.

Plato's Doubling the Square Problem

The challenge is to construct a square with exactly double the area of a given square. The boy in Plato's dialogue initially guesses, incorrectly, that doubling the side length would work. Through careful questioning, Socrates leads him to the correct geometric solution: the new square must be built upon the diagonal of the original square.

The study, published in the International Journal of Mathematical Education in Science and Technology, was led by Dr. Nadav Marco of the Hebrew University and David Yellin College of Education, in collaboration with Professor Andreas Stylianides of the University of Cambridge.

Unexpected and Flawed Responses

Given the fame of Plato's dialogue, the researchers expected ChatGPT-4 to immediately recognize the problem and provide the classical geometric solution. However, the AI's response was surprising.

An Algebraic Preference

When asked to double the area of a square, ChatGPT opted for an algebraic approach, a method that would have been unknown in ancient Greece. It resisted multiple prompts aimed at guiding it toward the geometric answer. Only after researchers expressed disappointment in its approximate answer did the AI produce the elegant, exact geometric solution.

"If it had only been recalling from memory, it would almost certainly have referenced the classical solution...straight away," stated Professor Stylianides. "Instead, it seemed to take its own approach."

This behavior suggested that the AI was not simply searching its database for the most famous answer but was generating what it determined to be a valid solution path on its own.

A Human-Like Error

The team then presented a variation of the problem: doubling the area of a rectangle while maintaining its proportions. ChatGPT again defaulted to algebra. When pressed for a geometric solution, it made a critical mistake.

The AI correctly noted that using the diagonal does not work for doubling a rectangle. However, it then incorrectly claimed that a geometrical solution was unavailable. In fact, a different geometric solution does exist. The researchers believe the AI improvised this false conclusion based on their previous conversation about the square.

Dr. Marco noted that the chance of this specific false claim existing in its training data was "vanishingly small." This indicates the AI was attempting to reason from prior interactions, a process that led it to a flawed conclusion, much like a human learner might overgeneralize a rule.

Implications for Education

The study's findings have significant implications for how AI is used in education. The unpredictability of models like ChatGPT means they cannot be treated as infallible sources of information.

Dr. Marco, a former high school math teacher, emphasized the need for critical thinking. "Users have to develop an independent sense of criticism because ChatGPT makes mistakes," he said. "Unlike proofs found in reputable textbooks, students cannot assume that ChatGPT’s proofs are valid."

The researchers propose that this limitation can be turned into a pedagogical strength. They liken the AI's behavior to the educational concept of a "zone of proximal development" (ZPD), which describes the gap between what a learner can do alone and what they can achieve with guidance.

Students can be tasked with verifying the AI's answers.
Teachers can use the AI's flawed reasoning as a teaching moment.
Engaging with the AI can help develop skills in proof evaluation and logical reasoning.

The key, according to the authors, is to frame interactions as a collaborative exploration rather than a simple request for an answer. Prompts like, "I want us to explore this problem together," are more valuable than, "Tell me the answer."

The Challenge for Educators

While presenting an opportunity, the rise of powerful AI also creates challenges. Dr. Marco expressed concern about academic integrity, as AI can generate and paraphrase text, making it difficult to detect plagiarism or a lack of genuine understanding.

"When teachers and lecturers get what students write, they have to make sure it’s authentic and not copied and pasted from AI," he asserted. To counter this, he suggests a renewed focus on other forms of assessment.

He mentioned that at David Yellin College, he is encouraging more oral exams and video presentations where students must explain subjects themselves. This ensures they have internalized the material rather than relying on an external tool to generate their work.

Ultimately, the study suggests that as AI becomes more integrated into learning, the skills of critical thinking, verification, and independent reasoning will become more important than ever. The goal is to use AI not as a crutch, but as a partner that challenges assumptions and deepens understanding.

Key Takeaways

Researchers tested ChatGPT-4 with Plato's classic "doubling the square" geometry problem.
The AI consistently chose modern algebraic methods over the well-known classical geometric solution.
ChatGPT made a significant error, incorrectly claiming a geometric solution was impossible for a related problem.
The study suggests ChatGPT's problem-solving is not just memory retrieval but involves a form of on-the-fly reasoning that can be flawed.
These findings indicate that while AI can be a powerful educational tool, it requires critical evaluation from students and educators.

Testing AI with an Ancient Problem

Plato's Doubling the Square Problem

Unexpected and Flawed Responses

Given the fame of Plato's dialogue, the researchers expected ChatGPT-4 to immediately recognize the problem and provide the classical geometric solution. However, the AI's response was surprising.

An Algebraic Preference

"If it had only been recalling from memory, it would almost certainly have referenced the classical solution...straight away," stated Professor Stylianides. "Instead, it seemed to take its own approach."

This behavior suggested that the AI was not simply searching its database for the most famous answer but was generating what it determined to be a valid solution path on its own.

A Human-Like Error

Implications for Education

The study's findings have significant implications for how AI is used in education. The unpredictability of models like ChatGPT means they cannot be treated as infallible sources of information.

Students can be tasked with verifying the AI's answers.
Teachers can use the AI's flawed reasoning as a teaching moment.
Engaging with the AI can help develop skills in proof evaluation and logical reasoning.

Key Takeaways

Testing AI with an Ancient Problem

Plato's Doubling the Square Problem

Unexpected and Flawed Responses

An Algebraic Preference

A Human-Like Error

Implications for Education

The Challenge for Educators

Related Articles

AI 'Slop' Is Flooding Your Feeds. Can You Spot It?

Google's Gemini 3 Challenges AI Chip Market

Generative AI Creates Unique Digital Realities

AI Recipes Impact Holiday Cooking and Food Bloggers

Key Takeaways

Testing AI with an Ancient Problem

Plato's Doubling the Square Problem

Unexpected and Flawed Responses

An Algebraic Preference

A Human-Like Error

Implications for Education

The Challenge for Educators