AI struggles with cognitive tests, raising concerns

A recent study revealed that AI models, including ChatGPT 4 and Claude 3.5, performed poorly on cognitive tests used to detect early signs of dementia, sparking concerns over AI's role in medical diagnostics.

Agencies and A News TECH

Published December 19,2024

Subscribe

AI STRUGGLES WITH COGNITIVE TESTS, RAISING CONCERNS

A recent study has revealed that artificial intelligence technologies, known as large language models, exhibit signs of cognitive decline when tested for early dementia symptoms in humans.

According to research published in The BMJ, popular AI models like ChatGPT 4, Claude 3.5, and Gemini scored poorly on the Montreal Cognitive Assessment (MoCA), a test used to measure cognitive abilities such as attention, memory, language, visual-spatial skills, and executive functions.

The study also found that older AI models performed worse on the tests, similar to aging human patients. The authors suggest that these findings challenge the notion that AI will soon replace human doctors.

The increasing use of AI in medicine has raised both excitement and concerns, with previous studies showing success in various diagnostic tasks, but the vulnerability of AI to cognitive decline and human-like disorders had not been widely explored until now.

To address this gap, researchers evaluated publicly available AI models—ChatGPT 4, Claude 3.5, and Gemini 1 and 1.5—using the MoCA test, commonly used to detect early dementia signs in elderly adults.

Each model was given the same instructions as human patients, and the test was scored according to official guidelines. ChatGPT 4 scored the highest (26/30), followed by Claude (25/30), while Gemini 1.0 received the lowest score (16/30).

All AI models struggled with spatial skills and executive functions, such as connecting numbers and letters in sequence and drawing a clock face.

The Gemini models also failed to recall a five-word sequence. However, most other tasks, including naming, attention, language, and abstraction, were performed well across all models.