Recent headlines have claimed that an AI chatbot has officially passed the Turing test, marking what some see as a major milestone in artificial intelligence. These reports are based on a preprint study conducted by researchers Cameron Jones and Benjamin Bergen at the University of California, San Diego. Their study found that OpenAI’s GPT-4.5 was judged to be human more than 70% of the time during a controlled experiment—suggesting it has reached a new level of conversational realism.
The experiment, which has not yet undergone peer review, tested four large language models (LLMs): ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5. A total of 284 participants were involved, alternating between roles as interrogators and witnesses. Interrogators engaged in text-based conversations with two entities—one human, one AI—via a split-screen interface for five minutes. At the end of each session, participants were asked to determine which was human.
GPT-4.5 was misidentified as the human 73% of the time. LLaMa-3.1-405B also performed well, being mistaken for a human in 56% of the trials. In contrast, older or less advanced models like ELIZA and GPT-4o only fooled participants 23% and 21% of the time, respectively.
The Turing test, first conceptualized by Alan Turing in the mid-20th century, is designed to evaluate whether a machine can exhibit behavior indistinguishable from that of a human. In its most recognized form, it involves a human judge conversing with both a machine and another human without knowing which is which. If the judge cannot reliably tell the machine apart from the human, the machine is said to have passed the test.
Turing introduced the test to move away from the vague question “Can machines think?” Instead, he proposed an “imitation game” as a practical way to evaluate a machine’s performance in human-like interaction.
Despite its cultural significance, many researchers dispute the Turing test as a valid measure of machine intelligence. There are several key criticisms:
- Behavior vs. Intelligence: Passing the test may demonstrate human-like behavior, but not actual understanding or thought.
- Human vs. Machine Minds: Critics argue that equating the brain to a machine oversimplifies human cognition.
- Different Cognitive Processes: Machines may reach answers in ways fundamentally different from humans, making comparison flawed.
- Narrow Scope: The test evaluates only one type of behavior—conversation—which may be too limited to assess true intelligence.
While GPT-4.5’s performance is a remarkable achievement in AI development, it doesn’t settle the question of whether machines can “think” or be considered intelligent in a human sense. The model’s ability to pass the Turing test reflects its capacity for imitation, not necessarily comprehension or reasoning.
As AI continues to evolve, researchers are exploring more comprehensive frameworks to assess machine intelligence, moving beyond simple conversational benchmarks to include reasoning, learning, adaptability, and ethical decision-making.
By Impact Lab