AI language models can distinguish between events that are common, improbable, impossible, or nonsensical, according to a new study from Brown University released on April 22. The research will be presented at the International Conference on Learning Representations in Rio de Janeiro.
The findings matter as they suggest that artificial intelligence systems are developing an internal representation of how the world works—one that aligns with human reasoning about plausibility and causality. This could influence future development of more reliable and trustworthy AI tools.
The study examined several open-source AI language models by analyzing their responses to sentences describing various scenarios. Examples included commonplace situations like “Someone cooled a drink with ice,” improbable ones such as “Someone cooled a drink with snow,” impossible events like “Someone cooled a drink with fire,” and nonsensical statements such as “Someone cooled a drink with yesterday.” Researchers then used mechanistic interpretability—a method likened to neuroscience for machines—to analyze the mathematical states generated inside each model when processing these sentences.
“This work reveals some evidence that language models have encoded something like the causal constraints of the real world,” said Michael Lepori, a Ph.D. candidate at Brown who led the research. “Beyond just encoding these constraints, they do so in a way that is predictive of human judgments of these categories.”
By comparing differences in internal representations across multiple AI systems—including OpenAI’s GPT-2, Meta’s Llama 3.2, and Google’s Gemma 2—the researchers found distinct patterns strongly correlated with each plausibility category. These patterns allowed the models to differentiate even closely related categories (such as improbable versus impossible) with about 85% accuracy.
Lepori said: “Mechanistic interpretability can be appropriately characterized as something like neuroscience for AI systems… You could kind of think about it as understanding what is encoded in the ‘brain state’ of the machine.” He added that when people disagreed over whether certain statements were impossible or merely unlikely—such as “Someone cleaned the floor with a hat”—the models mirrored this uncertainty by assigning similar probabilities.
According to Lepori, “What we show is that the models actually capture that human uncertainty pretty well… In cases where, say, 50% of people said a statement was impossible and 50% said it was improbable, the models were assigning roughly 50% probability as well.”
These results suggest modern AI language systems can develop an understanding reflective of human thought processes even at relatively modest scales (over two billion parameters). The researchers believe further mechanistic interpretability studies may help create smarter and more trustworthy artificial intelligence.




