Large Language Models Put to the Test
On June 15 at 7:00 PM, a research team led by Suketu Patel evaluated large language models including GPT, Claude, and Gemini using the Stroop test. In this classic cognitive challenge, participants see the word "red" printed in blue ink and must name the ink color while ignoring the word's meaning. The study uncovered major discrepancies in model accuracy depending on the length of the word lists used.
Key Findings from the Research
When presented with short lists of just five words, the GPT-4o language model achieved an impressive 91% accuracy rate. However, performance plummeted to 57% with ten words and dropped further to 15% when the list expanded to forty words. Researchers observed that
"the models simply seemed to lose the thread", highlighting their difficulty with the task. Meanwhile, Claude 3.5 Sonnet performed steadily on shorter lists but managed only 24% accuracy on the forty-word set.
The study concluded that when congruent words—where color and meaning match—were mixed with conflicting ones, accuracy on the conflicting items fell to nearly zero. As the researchers noted,
"the initial instruction somehow got lost along the way". They also emphasized that "the ability to maintain focus on a specific goal amid competing information, especially over long sequences, is fundamentally different in language models compared to humans."
These results reveal that while language models excel at certain tasks, their handling of conflicting information—particularly with larger word sets—leaves much to be desired. Importantly, this research opens new avenues for improving AI technologies, especially in enhancing their information processing and decision-making capabilities under complex conditions.
These findings mirror previous observations in the AI field, where language models often struggle with brevity and focus. For instance, a recent study highlighted that AI writers tend to rely heavily on a limited vocabulary, with a significant portion of their output centered around just a few words. This pattern raises questions about the adaptability and creativity of AI in more complex writing scenarios. To explore this further, you can read about how AI writers are limited to a few key terms in their storytelling.