GPT-5 Falls Short in Human Attention Test

A long-standing psychology test has highlighted a significant weakness in advanced artificial intelligence (AI) systems, showing that their focus may differ from that of humans. Researchers, led by Suketu Patel, studied how large language models (LLMs), including GPT-5, perform on a known cognitive test called the Stroop task.

The Stroop task involves showing participants words that name colours, such as “red” or “blue,” displayed in different ink colours. Participants must identify the ink colour while ignoring the word’s meaning, which causes a mental conflict. Humans typically become slower at responding when the ink colour does not match the word, a phenomenon known as the Stroop effect. However, even during lengthy tasks, people generally maintain high accuracy and focus.

To determine how AI models cope with similar challenges, the researchers tested several leading LLMs. Initially, these models performed well with short word lists, with GPT-4o achieving 91% accuracy. However, the situation changed dramatically with longer lists. With ten words, GPT-4o’s accuracy dropped to 57%, and with forty words, it plummeted to just 15%. Claude 3.5 Sonnet also saw a decline in performance as list length increased, dropping to 24% accuracy with forty words.

These results indicate a crucial difference between human and AI cognition. While AI excels at recognising words, it struggles to suppress automatic responses and maintain focus over time. This suggests that the attention mechanisms of AI systems fundamentally differ from those in the human brain, revealing important limitations as AI becomes more integrated into daily life.

Test Your Understanding

Start Quiz

Vocabulary List:

6 words · tap to reveal

Accent

psychology/saɪˈkɑlədʒi/noun

study of the mind and how people behave

cognitive/ˈkɑɡnətɪv/adjective

relating to thinking, learning, and remembering

accuracy/ˈækjərəsi/noun

how correct or exact something is

plummeted/ˈplʌmɪtɪd/verb

fell very quickly to a much lower level

suppress/səˈprɛs/verb

stop or hold back something from happening

limitations/ˌlɪməˈteɪʃənz/noun

things that make something less effective

Vocabulary Learning

How much do you know?

What is the Stroop task primarily used to study?

Memory recall

Word recognition

Color recognition and mental conflict

Language translation

What was the accuracy of GPT-4o when tested with ten words?

91%

57%

24%

15%

Which model achieved a 91% accuracy on the Stroop task with short word lists?

Claude 3.5 Sonnet

GPT-4o

Suketu Patel's model

GPT-5

How does human focus typically compare when ink color matches the word versus when it does not?

Faster and more accurate

Slower and less accurate

No change in speed

Faster but less accurate

What happens to GPT-4o's accuracy with forty words?

It remains stable

It drops dramatically

It improves

It fluctuates

Who led the research study on AI systems and the Stroop task?

Claude 3.5 Sonnet

Researchers at MIT

Suketu Patel

GPT-5 team

Humans tend to maintain high accuracy and focus even with lengthy Stroop tasks.

True False

AI models exhibited consistent performance regardless of list length in the Stroop task.

True False

The Stroop task involves participants identifying the meaning of word color.

True False

AI systems perform better on longer lists than on shorter lists according to the study.

True False

The study suggests that AI and human cognition mechanisms are fundamentally similar.

True False

Claude 3.5 Sonnet achieved a 24% accuracy with forty words in the Stroop task.

True False

The Stroop task describes a phenomenon known as the Stroop effect, which causes a mental conflict when the ink color does not match the word. This phenomenon affects human response speed and accuracy, indicating a gap between AI and human cognition.

In the study, GPT-4o's accuracy dropped to 15% when tested with forty words and demonstrated a significant gap in performance compared to its results with short lists.

Researchers led by Suketu Patel examined how large language models (LLMs) cope with challenges similar to that of the Stroop task, revealing a gap in focus and automatic responses.

Humans typically exhibit a slower response when the ink color of the Stroop words does not match the color of the words themselves.

The cognitive test revealed important limitations in AI systems, indicating that their attention mechanisms differ from those found in the human brain.

While AI excels at recognizing words, the study highlights a gap in its ability to maintain focus during lengthy tasks compared to humans.

This question is required

How much do you know?

How much do you know?

Read More

Ben Affleck Wins £1 Million for Charity on Quiz Show

Blackstone Transforms Jersey Mike’s Ahead of Employee IPO

Gronkowski Describes Kelce and Swift’s Wedding as Intimate and Magnificent

Gronkowski Describes Kelce and Swift’s Wedding as Intimate and Magnificent

Ben Affleck Wins £1 Million for Charity on Quiz Show

Blackstone Transforms Jersey Mike’s Ahead of Employee IPO

Gronkowski Describes Kelce and Swift’s Wedding as Intimate and Magnificent