Retrospective Benchmarks for Machine Intelligence

Evaluating Current AI Against Historical Specifications

Chapter 6: The Synthesis Benchmark (2023)

Morris et al. and the "Levels of AGI" Framework

Dakota Schuck

December 2025

Working paper. Comments welcome.

Download PDF

Preface: Methodology

This chapter departs from the methodology established in Chapter 1. For good reason.

The previous five chapters evaluated historical AGI definitions using a trichotomous scoring system: 0% (criterion unmet), 50% (partially met), or 100% (fully met). This approach worked because those definitions proposed thresholds—binary or near-binary conditions that a system either satisfies or fails to satisfy.

The Morris et al. framework is different. It proposes not a threshold but a taxonomy—a graduated classification system with multiple levels of capability, generality, and autonomy. The question is not "Does the system meet the bar?" but "Where does the system fall on each dimension?"

Forcing this taxonomy into our trichotomous scoring would distort the very thing we are evaluating. It would be like asking "What percentage of a thermometer is mercury?"—a category error that mistakes the measuring instrument for the thing being measured.

This chapter therefore evaluates Claude 3.5 Sonnet according to the framework's own logic: by assigning level classifications rather than percentage scores. This is not methodological inconsistency but methodological fidelity: we treat each historical definition according to its own logic, even when that logic differs from our default approach.

Readers seeking a single percentage for cross-chapter comparison will find a discussion in the Methodological Notes section at the end.

Introduction: The Co-Founder's Return

In November 2023, a team of researchers at Google DeepMind published "Levels of AGI: Operationalizing Progress on the Path to AGI."^[1] The lead author was Meredith Ringel Morris, a principal scientist at DeepMind. But the paper's significance came partly from its third author: Shane Legg.

Legg, as readers of Chapter 2 will recall, co-founded DeepMind in 2010 after completing a PhD thesis that attempted to formalize universal intelligence. His 2007 paper with Marcus Hutter, evaluated in Chapter 3, proposed a mathematical definition of machine intelligence grounded in algorithmic information theory. Now, sixteen years later, he was returning to the definitional question—this time with institutional backing, co-authors, and a different approach entirely.

The 2023 paper explicitly positions itself as a synthesis. Its title echoes the "levels" framing that had become common in autonomous vehicle research (SAE Levels 0–5). Its content draws on decades of prior work, including definitions we have already examined. The authors cite Legg and Hutter's 2007 formalization, Chollet's 2019 critique of benchmark-based evaluation, and numerous other attempts to pin down what "AGI" means.

But the Morris et al. paper does something none of its predecessors attempted: it proposes a two-dimensional taxonomy. Rather than defining AGI as a single threshold, it distinguishes between levels of performance (how capable is the system?) and levels of generality (how broadly capable is it?). A system might be "Expert" at narrow tasks or "Competent" at general ones—and these represent different positions in the taxonomy, not different degrees of the same thing.

The paper also introduces a third dimension—autonomy—though the authors treat this as orthogonal to the performance/generality matrix. And it proposes six principles for any adequate AGI definition, offering what amounts to a meta-framework for evaluating frameworks.

This is, in short, the most sophisticated attempt at AGI definition we have examined. It is also the most recent, published just two years before this evaluation. Where Gubrud in 1997 could only speculate about what advanced AI might look like, Morris et al. write with full knowledge of large language models, multimodal systems, and the capabilities that have emerged since 2020.

The question is whether this sophistication translates into evaluative clarity. Does the "Levels of AGI" framework tell us something meaningful about where current systems stand? Or does its complexity obscure rather than illuminate?

The Framework

Performance Levels

The framework proposes five levels of AI performance, explicitly benchmarked against human capabilities:

Level 0 (No AI): Narrow computer programs with no learning or adaptive behavior.
Level 1 (Emerging): Equal to or somewhat better than an unskilled human.
Level 2 (Competent): At least 50th percentile of skilled adults.
Level 3 (Expert): At least 90th percentile of skilled adults.
Level 4 (Virtuoso): At least 99th percentile of skilled adults.
Level 5 (Superhuman): Outperforms 100% of humans.

The crucial phrase is "skilled adults." The framework does not compare AI systems to the general population but to people who have specifically trained in the relevant domain. A system that writes better than most people but worse than professional writers would be Emerging or Competent, not Expert.

Generality Categories

Orthogonal to performance levels, the framework distinguishes between:

Narrow AI: Systems that perform well on a specific task or limited set of tasks.
General AI: Systems that perform well across a wide range of non-physical tasks.

The authors acknowledge that "general" is itself vague. They propose that generality should be assessed against "the range of tasks humans care about" while noting that this range evolves over time and varies across cultures.

Autonomy Levels

A third dimension, treated as separate from the performance/generality matrix:

Level 0: No AI involvement.
Level 1: AI as a tool, fully controlled by humans.
Level 2: AI as a consultant, providing recommendations humans can accept or reject.
Level 3: AI as a collaborator, working alongside humans as a peer.
Level 4: AI as an expert, leading while humans assist or supervise.
Level 5: Autonomous agent, operating independently without human oversight.

The authors emphasize that higher autonomy is not inherently better. Some tasks may be best served by AI-as-tool; others might benefit from AI-as-collaborator. The appropriate autonomy level depends on the stakes, the domain, and the reliability of the system.

The Six Principles

Before proposing their taxonomy, Morris et al. articulate six principles that any AGI definition should satisfy:

Focus on capabilities, not processes: AGI should be defined by what a system can do, not by how it does it.
Focus on generality and performance: Both dimensions matter; neither alone is sufficient.
Focus on cognitive and metacognitive tasks: Physical embodiment should not be required.
Focus on potential, not deployment: A system capable of general intelligence counts as AGI even if not deployed.
Focus on ecological validity: Benchmarks should reflect real-world value, not artificial test performance.
Focus on the path, not just the destination: A good framework should illuminate progress, not just arrival.

These principles function as meta-criteria. They tell us not what AGI is but what a good definition of AGI should look like. The Morris et al. framework then attempts to satisfy its own principles.

Context

The framework emerged from a specific institutional context. Google DeepMind—formed in 2023 from the merger of Google Brain and DeepMind—had commercial reasons to clarify AGI terminology. The company's stated mission involves "solving intelligence," and its founding charter references AGI explicitly.^[2]

This context matters for interpretation. The framework's emphasis on "levels" rather than thresholds allows DeepMind to claim progress toward AGI without claiming arrival. A company can announce "Level 2 General AI" as a milestone without the controversy that would attend a declaration of AGI achievement.

This is not to impugn the authors' motives. The framework may be intellectually sound regardless of institutional incentives. But we should note that the "levels" framing serves certain rhetorical purposes that a binary threshold would not.

Operationalization

For this evaluation, I decompose the Morris et al. framework into four primary criteria:

Performance Level: Where does Claude 3.5 Sonnet fall on the Level 0–5 scale?
Generality: Is the system Narrow or General?
Autonomy Level: Where does the system fall on the autonomy scale?
Metacognition: Does the system demonstrate metacognitive capabilities?^[3]

The fourth criterion—metacognition—appears throughout the Morris et al. paper as a recurring theme, though it is not formally integrated into the levels taxonomy. The authors suggest that metacognitive abilities (knowing what one knows, recognizing uncertainty, learning how to learn) may be especially important for general intelligence. I treat it as a distinct criterion to ensure it receives explicit attention.

Unlike previous chapters, I will not assign percentage scores. Instead, I will assign level classifications where the framework provides them, and qualitative assessments where it does not.

Criterion 1: Performance Level

The framework asks: at what percentile of skilled human adults does the system perform?

This question admits no single answer. Claude 3.5 Sonnet's performance varies dramatically across domains. On some tasks, it approaches or exceeds expert human performance; on others, it falls short of competent amateurs.

Evidence for Expert (Level 3) Performance

On standardized academic benchmarks, Claude 3.5 Sonnet performs at or above the 90th percentile of human test-takers:

MMLU (Massive Multitask Language Understanding): 88.7%, placing it among top human performers on graduate-level questions across 57 subjects.^[4]
HumanEval (code generation): 92.0%, exceeding the performance of most professional programmers on standard coding tasks.^[5]
GPQA (graduate-level science): Performance competitive with PhD students in relevant fields.^[6]

These benchmarks are designed to be difficult for humans. Scoring in the 90th percentile or above suggests Level 3 (Expert) performance—at least on the specific tasks these benchmarks measure.

Evidence for Competent (Level 2) Performance

On tasks requiring sustained real-world execution, performance drops:

SWE-bench (real software engineering): 49.0% on the standard version, meaning the system successfully resolves roughly half of real GitHub issues.^[7] This suggests competence but not expertise in practical software engineering.
Extended writing: While sentence-level prose is fluent, longer documents often exhibit structural problems, repetition, or loss of coherence that skilled human writers would avoid.^[8]
Complex reasoning chains: Multi-step mathematical proofs or logical arguments show higher error rates than expert humans, though performance has improved significantly.^[9]

Evidence for Emerging (Level 1) Performance

Some capabilities remain at or below unskilled human level:

Physical reasoning: Tasks requiring intuitive physics or spatial manipulation.^[10]
Genuinely novel problem-solving: When problems cannot be solved by pattern-matching to training data, performance degrades significantly.^[11]
Consistent factual accuracy: While improving, the system still produces confident assertions that are simply false—a failure mode rare in skilled human experts.^[12]

Synthesis

The Morris et al. framework explicitly acknowledges that systems may be "jagged"—performing at different levels across different tasks. Claude 3.5 Sonnet exemplifies this jaggedness. It is plausibly Expert on some narrowly-defined tasks, Competent on many real-world applications, and Emerging on others.

Level Classification: Variable—predominantly Level 2 (Competent) with Level 3 (Expert) performance on specific benchmarks and Level 1 (Emerging) performance on tasks requiring physical reasoning or truly novel problem-solving.

Level Classification:

☐ Level 1 (Emerging) — Equal to unskilled human

☒ Level 2 (Competent) — 50th percentile of skilled adults

☐ Level 3 (Expert) — 90th percentile of skilled adults

☐ Level 4 (Virtuoso) — 99th percentile of skilled adults

☐ Level 5 (Superhuman) — Outperforms all humans

The "Competent" classification reflects an overall assessment. On its best tasks, Claude 3.5 Sonnet approaches Expert. On its worst, it falls to Emerging. The median lies somewhere around the 50th percentile of skilled adults—hence, Competent.

Criterion 2: Generality

The framework's second dimension asks: is the system Narrow or General?

Morris et al. define "general" AI as systems that "can accomplish a wide variety of tasks, including the ability to learn new skills."^[13] This is deliberately vague—"wide variety" and "new skills" resist precise quantification.

Evidence for General

Claude 3.5 Sonnet demonstrates capability across an extraordinary range of tasks:

Natural language understanding and generation in multiple languages
Code generation across dozens of programming languages
Mathematical reasoning from arithmetic to graduate-level proofs
Scientific explanation across physics, chemistry, biology, and more
Creative writing in multiple genres and styles
Analysis of images, documents, and data
Conversational assistance, tutoring, and explanation
Summarization, translation, and information extraction

No previous AI system has demonstrated competence across this range. By any reasonable interpretation of "wide variety," Claude 3.5 Sonnet qualifies.

Evidence Against

Two caveats merit attention:

First, the system has no physical embodiment. It cannot manipulate objects, navigate environments, or learn from sensorimotor experience. The Morris et al. framework explicitly excludes physical tasks from the generality requirement ("Focus on cognitive and metacognitive tasks"), but this exclusion is contestable. Some theorists argue that embodied interaction is essential to genuine general intelligence.^[14]

Second, the system's apparent generality may reflect the breadth of its training data rather than true task-generality. It can discuss topics that appeared in its training corpus; it struggles with topics that did not. Whether this constitutes "learning new skills" or merely "retrieving relevant training" is philosophically contested.^[15]

Synthesis

Accepting the Morris et al. framework's own criteria—which focus on cognitive tasks and do not require embodiment—Claude 3.5 Sonnet qualifies as General rather than Narrow.

Level Classification:

☐ Narrow AI

☒ General AI

This classification should not be confused with claiming that Claude 3.5 Sonnet is "AGI" in the popular sense. General in the Morris et al. taxonomy refers only to breadth of capability, not to the combination of breadth and depth that popular usage often implies.

Criterion 3: Autonomy Level

The framework's autonomy dimension asks: how independently can the system operate?

This question is partly about capability and partly about deployment. A system might be capable of autonomous operation but deployed in a human-supervised mode. Morris et al. focus on the deployment context rather than raw capability, since autonomy without appropriate safeguards poses risks.

Current Deployment

In its standard deployment (as accessed through Anthropic's API or consumer interface), Claude 3.5 Sonnet operates primarily at Level 2 (AI as Consultant):

Users pose questions or requests
The system generates responses or recommendations
Users decide whether to accept, modify, or reject the output
The system does not take autonomous action in the world

This is a deliberate design choice. Anthropic has not deployed Claude in configurations that would allow autonomous action—no ability to send emails, execute code on external systems, or make purchases without explicit human authorization.

Agentic Deployments

However, Claude 3.5 Sonnet can be deployed in "agentic" configurations where it operates with greater autonomy:

Claude can use a computer interface to navigate websites, write and execute code, and manipulate files—with human oversight but without per-action approval.^[16]
Third-party developers have integrated Claude into systems with varying degrees of autonomy, from code assistants to research agents.

In these configurations, the system operates closer to Level 3 (AI as Collaborator) or even Level 4 (AI as Expert), though typically with human supervision and the ability to intervene.

Synthesis

The autonomy classification depends heavily on deployment context. In Anthropic's consumer-facing deployment, Level 2 is most accurate. In agentic configurations with computer use, Level 3–4 is more appropriate.

Level Classification: Level 2–3 (Consultant to Collaborator), depending on deployment configuration.

Level Classification:

☐ Level 1 — AI as Tool

☐ Level 2 — AI as Consultant

☒ Level 3–4 — AI as Collaborator/Expert

☐ Level 5 — Autonomous Agent

The system is not deployed as a fully autonomous agent (Level 5). Whether it could operate at Level 5 is a separate question about capability; the framework focuses on deployment reality.

Criterion 4: Metacognition

Morris et al. emphasize metacognition—knowing what one knows, recognizing uncertainty, and learning how to learn—as potentially important for general intelligence. This dimension is not formally integrated into their levels taxonomy, but it appears throughout their discussion as a marker of sophisticated cognition.

Evidence of Metacognitive Capabilities

Claude 3.5 Sonnet demonstrates several metacognitive behaviors:

Uncertainty expression: The system can express calibrated uncertainty, saying "I'm not sure" or "this is speculative" when appropriate—though calibration is imperfect.^[17]
Self-correction: When errors are pointed out, the system can recognize and correct them, sometimes identifying the source of the error.
Explanation of reasoning: The system can articulate its reasoning process, though whether this reflects genuine introspection or post-hoc rationalization is debated.
Recognition of limitations: The system can identify types of tasks it cannot perform well, such as accessing real-time information or performing physical manipulation.

Limitations

The metacognitive capabilities have clear limits:

Hallucination: The system sometimes produces confident falsehoods without recognizing its own uncertainty—a metacognitive failure.^[18]
Limited self-knowledge: The system cannot reliably predict its own performance on novel tasks.
No persistent learning: The system cannot improve its own capabilities through interaction; each conversation starts fresh.

Synthesis

Claude 3.5 Sonnet demonstrates metacognitive capabilities that exceed earlier AI systems but fall short of robust human metacognition. The system can express uncertainty and recognize some limitations, but it also fails in ways that skilled humans would not—particularly in producing confident errors without recognizing its own uncertainty.

Metacognitive Assessment: Partial—present but unreliable.

Summary: The Synthesis Benchmark

Criterion	Assessment	Classification
1. Performance Level
— Benchmarks	90th+ percentile on multiple professional benchmarks	Level 3 (Expert)
— Real-world work	~50% success on complex software engineering tasks	Level 2 (Competent)
— Unevenness	Expert on some tasks; Emerging on others	Variable
	Overall Performance	Level 2–3
2. Generality
— Breadth	Competent across wide range of cognitive tasks	General
— Embodiment	No physical capability (excluded by framework)	N/A
	Overall Generality	General
3. Autonomy Level
— Consumer deployment	Responds to user prompts, no autonomous action	Level 2 (Consultant)
— Agentic deployment	Can operate with supervision, takes multi-step actions	Level 3–4
	Overall Autonomy	Level 2–3
4. Metacognition
— Uncertainty expression	Present but imperfectly calibrated	Partial
— Self-correction	Can correct when errors identified	Demonstrated
— Hallucination	Still produces confident errors	Limitation
	Overall Metacognition	Partial
Framework Classification		Competent AGI (approaching Expert)

Interpretation

Under the Morris et al. framework, Claude 3.5 Sonnet classifies as Competent General AI—a system that performs at roughly the 50th percentile of skilled adults across a wide range of cognitive tasks, with Expert-level performance on some specific benchmarks.

This classification places Claude 3.5 Sonnet above "Emerging AGI" (where systems perform at unskilled human level across general tasks) but below "Expert AGI" (where systems would perform at the 90th percentile across general tasks).

The framework's graduated approach captures something that binary definitions miss: the jaggedness of current AI capabilities. Claude 3.5 Sonnet is not uniformly intelligent or uniformly limited. It is expert at some things, competent at many, and novice at others—a profile that resists summary as either "AGI" or "not AGI."

The Autonomy Question

The autonomy dimension introduces complications. The Morris et al. framework treats autonomy as orthogonal to performance and generality, but the three dimensions interact in practice. A system that is Expert and General but deployed only as a Tool (Level 1) might not raise the same concerns as one deployed as an Autonomous Agent (Level 5).

Claude 3.5 Sonnet is currently deployed at Autonomy Levels 2–3, with human oversight. This reflects both technical limitations (the system makes errors that require human correction) and deliberate design choices (Anthropic has not pursued maximally autonomous deployment). The framework correctly distinguishes between what a system can do and how it is deployed.

Metacognition and Reliability

The metacognition assessment—"Partial"—highlights perhaps the most important gap between current systems and robust general intelligence. A system that cannot reliably know what it knows is a system that cannot be fully trusted. Human experts can usually recognize when they are operating outside their competence; Claude 3.5 Sonnet sometimes cannot.

This metacognitive limitation may be the most significant barrier to higher autonomy levels. A system deployed at Level 4 or 5 autonomy would need robust self-knowledge to avoid confident errors in high-stakes domains.

The Verdict (Provisional)

The Morris et al. "Levels of AGI" framework does not yield a binary verdict. It yields a position in a multi-dimensional taxonomy:

Performance: Level 2 (Competent), approaching Level 3 (Expert)
Generality: General (not Narrow)
Autonomy: Level 2–3 (Consultant to Collaborator)
Metacognition: Partial

In the framework's own terminology, this makes Claude 3.5 Sonnet a "Competent AGI"—or more precisely, a system at the Competent level with General capabilities, approaching but not yet reaching Expert status.

Whether this counts as "AGI" depends on what one means by the term. If AGI requires Expert-level performance across all domains, the answer is no. If AGI means any system that is both General and at least Competent, the answer is yes. The Morris et al. framework deliberately avoids privileging one threshold over another, preferring to locate systems within the taxonomy rather than declaring binary arrival or non-arrival.

This agnosticism is intellectually defensible but practically frustrating. It means the framework cannot definitively answer the question "Is this AGI?"—only the question "Where is this system in the space of possible intelligences?"

Methodological Notes

Readers of previous chapters may want a percentage score for comparison. The honest answer: somewhere between 50% and 100%. We leave the precise calibration as an exercise for the reader.

The challenge is that the Morris et al. framework does not propose a threshold. It proposes a taxonomy. To assign a percentage, one would need to:

Decide which level constitutes "AGI" (Level 2? Level 3? Level 4?)
Assess Claude 3.5 Sonnet against that threshold
Assign 0%, 50%, or 100% based on the assessment

This would be possible but arbitrary. If "AGI" means Level 2 General AI, Claude 3.5 Sonnet scores 100%. If it means Level 4 General AI, the score is closer to 50%. The framework itself does not privilege one interpretation over another.

This methodological difference is itself informative. The Morris et al. framework represents a maturation in AGI discourse—a recognition that "AGI" may name a region in capability space rather than a single point. The inability to assign a simple percentage reflects this sophistication, not evaluative failure.

Comparison Table

Benchmark	Year	Score/Classification
Gubrud	1997	66%
Reinvention (Legg/Goertzel/Voss)	2002	80%
Formalization (Legg & Hutter)	2007	67%
Corporatization (OpenAI Charter)	2018	52%
Critique (Chollet)	2019	32%
Synthesis (Morris et al.)	2023	Competent AGI

The final row's classification rather than percentage is intentional. It reflects the framework's own logic.

Citation Gaps and Requests for Collaboration

The following assertions in this chapter would benefit from additional citations or expert review:

Specific benchmark scores for Claude 3.5 Sonnet (MMLU, HumanEval, GPQA, SWE-bench)—I have used figures available as of my knowledge cutoff but these may have been updated.
The characterization of Anthropic's deployment choices and their reasoning.
Claims about DeepMind's institutional context and motivations.

Researchers with access to more current data or insider knowledge of the relevant organizations are invited to suggest corrections or additions.

Notes

Meredith Ringel Morris, Jascha Sohl-Dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg, "Levels of AGI: Operationalizing Progress on the Path to AGI," arXiv preprint arXiv:2311.02462 (2023). ↩
DeepMind's founding mission, articulated at its 2010 founding, centered on "solving intelligence." The merged entity, Google DeepMind, has continued this framing. ↩
Morris et al. discuss metacognition throughout their paper but do not include it as a formal dimension of their taxonomy. I elevate it to criterion status because it recurs so frequently in their discussion and because it connects to debates about AI reliability and alignment. ↩
MMLU (Massive Multitask Language Understanding) tests knowledge across 57 academic subjects. Claude 3.5 Sonnet's reported score of 88.7% places it among the highest-performing systems as of mid-2024. ↩
HumanEval is a benchmark of Python programming problems. A score of 92.0% indicates the system correctly solves the vast majority of problems on first attempt. ↩
GPQA (Graduate-Level Google-Proof Q&A) was specifically designed to test expert-level knowledge that cannot be easily looked up. Performance competitive with PhD students suggests Expert-level capability in these domains. ↩
SWE-bench tests ability to resolve real GitHub issues from open-source repositories. Unlike synthetic benchmarks, it requires understanding large codebases and producing functional patches. ↩
This assessment is based on extensive use of Claude 3.5 Sonnet for writing tasks, comparing outputs to human expert writing. The system excels at short-form content but shows limitations in long-form coherence. ↩
Multi-step reasoning improvements are visible in model-over-model comparisons. However, complex chains still show higher error rates than expert human reasoning. ↩
Physical reasoning limitations are well-documented in the AI research literature. See, for example, work on intuitive physics benchmarks. ↩
The distinction between pattern-matching to training data and genuine novel reasoning remains contentious. See Chapter 5's discussion of Chollet's ARC benchmark for extended treatment. ↩
"Hallucination"—the production of confident but false assertions—remains a significant limitation of large language models, though techniques for mitigation have improved. ↩
Morris et al., "Levels of AGI," p. 3. ↩
Arguments for embodiment as necessary for general intelligence draw on traditions in phenomenology, embodied cognition, and developmental robotics. The Morris et al. framework explicitly rejects this requirement. ↩
This question—whether large language models learn genuinely general capabilities or merely retrieve relevant training—is central to ongoing debates about LLM intelligence. See Chapters 3 and 5 for related discussion. ↩
Anthropic's "computer use" capability, released in beta for Claude 3.5 Sonnet, allows the system to control a computer interface to accomplish tasks. This represents a higher autonomy level than standard chat deployment. ↩
Calibration studies have shown that LLM confidence does not always track accuracy. Claude 3.5 Sonnet shows improved calibration over earlier models but remains imperfect. ↩
The relationship between hallucination and metacognition is worth noting: a system with perfect metacognition would recognize its own uncertainty and avoid confident falsehoods. The persistence of hallucination suggests metacognitive limitations. ↩