AI's Challenge: Humanity’s Last Exam
Two major players in artificial intelligence from San Francisco are inviting the public to create questions that can thoroughly assess the abilities of large language models (LLMs), such as Google's Gemini and OpenAI's o1. Scale AI, a company that focuses on organizing vast datasets used for training LLMs, has collaborated with the Center for AI Safety (CAIS) to introduce the initiative called Humanity’s Last Exam.
As part of this competition, there are prizes totaling US$5,000 (£3,800) for individuals who contribute the top 50 questions that will be used for the test. The aim of Scale and CAIS is to evaluate how close we are to developing "expert-level AI systems" with what they describe as the "largest, broadest coalition of experts in history."
But why pursue this? Many leading LLMs have demonstrated impressive results by excelling in standard tests related to intelligence, mathematics, and law. However, it's challenging to determine how significant these achievements really are. In many instances, the models may already know the answers because they have been trained on a vast collection of data that includes a large portion of available internet content.
Data is the backbone of this shift from traditional computing methods to AI. This transformation is about shifting from "telling" machines what to do to "showing" them through quality training datasets and rigorous testing. Developers typically use data that wasn't included in the model training phase, referred to as "test datasets."
If LLMs currently lack the ability to pre-learn answers to established tests, they almost certainly will in the near future. According to a prediction by AI analytics site Epoch, by 2028, AI systems will have essentially absorbed all written human knowledge. A crucial concern will be how we continue to evaluate these AIs after reaching that milestone.
Of course, the internet is continuously growing, with millions of new pieces of content added every day. Might this influx of information solve the problem?
Perhaps, but it introduces another complex issue known as "model collapse." As the online space becomes saturated with AI-generated content that feeds back into future training datasets, this could lead to increasingly poor performance from AI systems. To combat this, many developers are gathering data from interactions between AIs and people to continually refresh their training and testing data.
Some experts assert that AIs should also be "embodied": interacting with the real world and learning from direct experiences, similar to human learning. This concept may seem unrealistic until we consider that companies like Tesla have been implementing it successfully through their vehicles. Another opportunity exists with wearable technology, like Meta's smart glasses by Ray-Ban, which incorporate cameras and microphones to gather a wealth of human-centric audio and video data.
Narrow Tests of Intelligence
Even if technological advancements yield enough future training data, the challenge remains: how should we define and measure intelligence—especially when it comes to artificial general intelligence (AGI), the level at which an AI can match or exceed human cognitive capacity?
Traditional human IQ tests have been controversial for not effectively capturing the diverse aspects of intelligence, including language skills, mathematical ability, empathy, and spatial awareness.
Similarly, artificial intelligence tests face the same limitations. Numerous established tests evaluate tasks such as summarizing and understanding information, drawing logical conclusions, recognizing human physical gestures, and interpreting visual data.
Tests are often retired because AI systems achieve high marks, but they tend to focus narrowly on specific tasks rather than a broader measure of intelligence. For example, the chess AI Stockfish outperforms Magnus Carlsen, the highest-rated human player ever, according to the Elo rating system. However, Stockfish cannot perform other tasks, such as language comprehension. It's clear that being exceptional at chess does not equate to possessing general intelligence.
As AI systems display increasingly intelligent behavior, the challenge lies in establishing fresh benchmarks to gauge their development. A notable contribution to this conversation has come from François Chollet, an engineer at Google in France. He believes true intelligence is found in an AI's capacity to adapt and apply learning to new, uncharted scenarios. In 2019, he proposed the "abstraction and reasoning corpus" (ARC), which consists of a set of puzzles structured as simple visual grids designed to test an AI's ability to infer and apply abstract principles.
Unlike earlier benchmarks that assess visual recognition through training an AI with countless labeled images, ARC provides limited prior examples. The AI must deduce the puzzle logic on its own, rather than memorizing potential answers.
Although ARC tasks are not exceedingly difficult for humans to complete, there is a US$600,000 prize for the first AI system to score 85%. Currently, we are far from that achievement. Two prominent LLMs, OpenAI's o1 preview and Anthropic's Sonnet 3.5, have both registered a score of only 21% on the ARC public leaderboard, known as ARC-AGI-Pub.
Another recent evaluation using OpenAI's GPT-4o managed to score 50%, albeit controversially due to its method of generating thousands of possible solutions before finding the one that ranked highest for the test. Even this result remains far from triggering the prize, and it does not come close to matching human scores of over 90%.
While ARC stands out as one of the most credible efforts to assess genuine intelligence in AI, the initiative by Scale and CAIS reflects an ongoing quest for effective alternatives. Notably, the questions that win prizes in this contest will remain unavailable on the internet, ensuring that AI models do not gain access to the exam content in advance.
It's crucial for us to understand when machines are nearing human-level reasoning, along with the associated ethical, safety, and moral dilemmas this convergence creates. Once we reach that point, we will be faced with an even greater existential question: how do we assess superintelligence? This formidable challenge remains on the horizon, and addressing it will require concerted effort from the entire AI community.
AI, Intelligence, Testing