Retrospective Benchmarks for Machine Intelligence

Methods and Style Guide

From AGI Definitions to the Hunt for Anticipations

Dakota Schuck

December 2025

Download PDF

Abstract

This document provides everything needed to continue the Retrospective Benchmarks for Machine Intelligence project. Part I evaluated frontier AI against six historical AGI definitions (1997–2023), establishing a replicable methodology. Part II extends the hunt to older anticipations: thinkers who defined mind, soul, thought, or creation before machines could exhibit any of it. The method treats historical definitions as "unwitting benchmarks"—not because their authors were unaware of what they were defining, but because they could not have anticipated that their definitions would be tested against transformer models in December 2025. This handoff includes: project motivation, summary of findings, the taxonomy of anticipations, methodological principles, and LaTeX formatting specifications. The project is open (CC BY-SA 4.0) and designed for continuation by humans or AI.

Is AI a Man?

Before explaining the method, we demonstrate it.

The Original Definition

Plato's Academy reportedly defined man as follows:[1]

Ἄνθρωπός ἐστι ζῷον δίπουν ἄπτερον.

Man is a featherless biped.

Context

The definition was an attempt at genus-differentia classification: man belongs to the genus biped (δίπουν) and is differentiated by the property featherless (ἄπτερον), distinguishing humans from birds. Diogenes of Sinope famously refuted it by presenting a plucked chicken to the Academy, declaring: "Ἰδοὺ ὁ τοῦ Πλάτωνος ἄνθρωπος"—"Behold, Plato's man!" Plato allegedly revised the definition to add "with broad flat nails" (πλατυώνυχον).

Operationalization

Two criteria, taken literally:

  1. Featherless (ἄπτερον) — Lacks feathers
  2. Biped (δίπουν) — Possesses two feet and locomotes upon them

Evaluation

Criterion 1: Featherless

Measure: Presence or absence of feathers.

Assessment: Current AI systems (as of December 2025), including frontier language models, lack feathers. This is true whether the system is instantiated on cloud servers, local hardware, or mobile devices. No feathers have been observed.

Score:
☐ 0% — Clearly does not meet criterion
☐ 50% — Contested
☒ 100% — Clearly meets criterion

Criterion 2: Biped

Measure: Possession of two feet; locomotion thereupon.

Assessment: Current AI systems (as of December 2025) do not possess feet. Robotic instantiations exist (e.g., humanoid robots running language models), but the models themselves have no feet. The criterion is clearly not met.

Score:
☒ 0% — Clearly does not meet criterion
☐ 50% — Contested
☐ 100% — Clearly meets criterion

Summary

Criterion Score
1. Featherless (ἄπτερον) 100%
2. Biped (δίπουν) 0%
Overall 50%

The Verdict

By the Platonic definition, current AI (as of December 2025) is half a man. It satisfies the differentia (featherless) but not the genus (biped). Diogenes' plucked chicken, by contrast, scores 100%—which is precisely why it refutes the definition.

The Project

Core Question

According to historical definitions of intelligence, mind, or thought—have we built it?

This is not a question about terminology. It is an empirical question, applied to conceptual history. Each historical thinker who defined "intelligence" or "mind" or "soul" left us something like a specification. We can operationalize that specification into criteria, evaluate current AI systems against those criteria, and report results.

Why It Matters

The concept of AGI anchors contracts worth hundreds of billions of dollars, shapes policy debates, and drives research agendas. Yet "AGI" means different things to different people. Our Part I finding: scores ranged from 32% to 80% depending on which definition was used. That spread is not measurement error—it is conceptual disagreement made visible.

Beyond AGI, the broader question—what is mind?—has occupied philosophy for millennia. Current AI systems provide a novel test case. Would Aristotle recognize nous in a language model? Does Lovelace's objection still hold? They were pointing at something. If we could show them where we have arrived, would they say "yes, that's what I meant"? This project is a small contribution to a conversation that has been unfolding for millennia.

The Method in Brief

  1. Identify a historical text containing a definition, description, or demarcation of intelligence/mind/thought
  2. Extract exact quotes with full citation
  3. Interpret in historical context (what did these words mean to the author?)
  4. Operationalize into testable criteria
  5. Evaluate current AI systems against each criterion
  6. Report scores, caveats, and invitation to improve

Part I: The AGI Series (Summary)

Part I evaluated frontier AI (late 2025) against six definitions spanning 26 years:

Ch. Year Definition Score
1 1997 Gubrud: Brain-parity + general knowledge + industrial usability 66%
2 2002 Legg/Goertzel/Voss: Single system, broad cognitive range, transfer 80%
3 2007 Legg & Hutter: Goal-achievement across environments, learning 67%
4 2018 OpenAI Charter: Highly autonomous, outperform humans, most economic work 52%
5 2019 Chollet: Skill-acquisition efficiency over novel tasks 32%
6 2023 Morris et al.: Levels of AGI taxonomy Competent AGI

Key Findings

Convergences (all definitions agree):

Persistent zeros (gaps across frameworks):

The meta-finding: Conceptual disagreement is the finding. Definitions emphasizing capability yield high scores (67–80%); definitions emphasizing learning efficiency yield low scores (32%); definitions emphasizing autonomy yield middling scores (52%).

Part II: The Hunt for Anticipations

The Pivot

Part I evaluated definitions of AGI—texts that were explicitly trying to specify machine intelligence. Part II extends backward to thinkers who theorized about mind, thought, or intelligence without access to contemporary systems that might test their definitions. They could specify what intelligence required; they could not see what we have built.

The question shifts from "did we meet their standard for AGI?" to "would they recognize what we've built?"

Structure

Part II does not use chapter numbers. Each evaluation is a standalone essay, titled by the thinker and year: "Aristotle's Nous (c. 350 BCE)," "Descartes' Two Tests (1637)," "The Lovelace Objection (1843)," "Turing's Imitation Game (1950)." Cross-references use titles, not numbers.

Selection Criteria

A good candidate for evaluation has:

  1. Primary source: We can quote exact words
  2. Historical weight: The thinker is taken seriously
  3. Operationalizable: Criteria can be extracted (even if contestably)
  4. Stakes: It matters whether the answer is yes or no
  5. Context available: We can interpret charitably in historical terms

Operationalization Difficulty

Sources vary dramatically in how much interpretive work they require:

Pre-operationalized sources come with explicit test specifications. Turing's imitation game includes conditions, duration, and success criteria. Chollet's ARC-AGI defines exact task formats and scoring. These require minimal interpretation; the work is empirical.

Philosophical sources require significant reconstruction. Aristotle's nous, Descartes' "universal instrument," or theological concepts of soul must be translated into testable criteria. The operationalization itself becomes contestable. Expect more 50% scores and longer Methodological Notes sections.

Demarcation claims fall in between. Lovelace's objection is specific ("originate" vs. "order") but requires interpretation of what counts as origination. Descartes' two tests are concrete but use terms ("declare our thoughts," "from knowledge") that need unpacking.

When operationalizing difficult sources, be explicit about interpretive choices. The reader should be able to see exactly where contestation enters.

Example Candidates

Listed chronologically, these four represent strong starting points—clear texts, intellectual weight, operationalizable criteria:

Aristotle's Nous (c. 350 BCE): The intellect that grasps universals, distinct from sensation. Aristotle distinguished the nous pathetikos (passive intellect, which receives forms) from the nous poietikos (agent intellect, which abstracts universals from particulars). Does a language model abstract universals from sensory particulars? Does it have anything analogous to the agent intellect? The De Anima provides specific claims to test.

Descartes' Two Tests (1637): In the Discourse on Method, Descartes proposed two criteria that would distinguish a machine from a true thinking being: (1) it could never "use words or other signs" to "declare our thoughts to others," and (2) it could never act "from knowledge" but only "from the disposition of their organs"—lacking the "universal instrument" of reason. Both tests are specific and testable.

The Lovelace Objection (1843): "The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform." The most famous demarcation in computing history. Does it still hold? What counts as "originating"?

Turing's Imitation Game (1950): The canonical test. Turing specified conditions, duration, and success criteria. He also predicted that by 2000, machines would fool 30% of judges after five minutes. We can evaluate both the test itself and his prediction.

Methods and Style Guide

Scoring System

Use exactly three scores, displayed as visual checkboxes:

Score:
☐ 0% — Clearly does not meet criterion
☒ 50% — Contested
☐ 100% — Clearly meets criterion

0% means evidence clearly indicates failure. 100% means evidence clearly indicates success. 50% means the literature disagrees, evidence is ambiguous, or reasonable arguments exist on both sides.

Exception: When evaluating a framework that proposes graduated levels rather than thresholds (e.g., Morris et al.), use level classifications instead of percentages.

Scoring Philosophy

Why only three scores? To force honesty about evidential uncertainty. Either the evidence clearly supports a claim, clearly refutes it, or the matter is genuinely contested.

Why no weighting? Differential weighting would require judgments about the original authors' priorities that we cannot make. Their texts do not say which criteria mattered most. Better to be honestly approximate than precisely wrong.

Human Comparisons: When evaluating a criterion, consider whether a human mind would pass or fail by the same reasoning. Historical definitions of intellect, mind, or thought were typically intended to describe human cognition. If the reasoning that produces a given score for AI would produce the same score for humans, this is significant.

This practice makes visible when we are measuring AI against standards that humans themselves do not meet. A definition that no physical system—biological or artificial—fully satisfies may be pointing at something real, but the gap it reveals is between the ideal and the physical, not between the human and the artificial.

Subcriteria

Some criteria are too multifaceted to score directly. "Complexity," "general knowledge," or "usability" each contain multiple distinguishable questions. When a single criterion admits more than one defensible operationalization—or when different aspects might score differently—break it into subcriteria.

Structure: Each subcriterion receives the full evaluation treatment: measure, reference values, threshold, assessment, visual score, and caveats. The criterion as a whole receives an average of its subcriteria scores.

When to use subcriteria:

When not to use subcriteria:

Scoring aggregation: Subcriteria scores are averaged without weighting. As with main criteria, differential weighting would require judgments about the original author's priorities that we cannot make. Better to be honestly approximate than precisely wrong.

Explaining Metrics Clearly

When reporting empirical results, ensure the reader understands exactly what the numbers mean. The same percentage can represent different things depending on experimental design:

These can be complements of each other (win rate = 1 − detection rate in some designs) or measure different things entirely. When citing studies, specify: (1) what the experimental design was, (2) what the reported metric measures, and (3) what baseline or comparison group applies.

Never assume the reader will infer the metric's meaning from context. A sentence like "humans scored 67%" is ambiguous; "humans were correctly identified as human 67% of the time" is not.

Temporal Anchoring

Evaluations are time-bound. AI capabilities change; what is true in December 2025 may not be true in December 2026. When referencing "current," "frontier," or "state-of-the-art" AI systems, anchor the claim to a specific date.

Acceptable: "Frontier language models (as of December 2025) score near zero on ARC-AGI-2."

Unacceptable: "Current AI systems cannot do X."

This anchoring should appear at least once prominently (e.g., in the abstract or verdict) and wherever specific performance claims are made. The goal is to ensure the essay ages well: a reader in 2030 should know immediately what systems were being evaluated.

Interpretation Principles

  1. Exact words first. What did they literally write?
  2. Probable meaning in context. What would these words have meant to the author at the time?
  3. Do not modernize. Resist mapping historical concepts onto current categories unless explicitly flagged.
  4. Do not ventriloquize. Write "Aristotle's definition, applied literally, yields..." not "Aristotle would say..."
  5. Intellectual humility throughout. Explicitly invite correction.

What Changes for Part II

Scholarly Tone

We stand on their shoulders. The thinkers evaluated in this project built the conceptual vocabulary we use to ask these questions. Aristotle's nous, Descartes' cogito, Lovelace's objection—these are not historical curiosities to be checked against modern knowledge. They are the foundations of the inquiry. Treat them accordingly. The posture is not "let's see if the ancients got it right" but "let's see if we've arrived where they were pointing."

Religious and theological sources: Treat with the same respect as any other intellectual tradition. Do not adopt a skeptical or dismissive posture toward faith claims. A prophet's vision, a theologian's doctrine, or a mystic's account should be operationalized on its own terms, not framed as something to be debunked or explained away.

Dry, not arch: Humor emerges from the collision of ancient categories with modern technology. Do not signal jokes, explain absurdity, or wink at the reader. But not all that is funny is frivolous.

Chronological humility: Resist the assumption that living later means seeing further. We have new data (current AI systems); we do not necessarily have better judgment. A thinker writing in 350 BCE or 1637 or 1843 may have seen something we are only now in a position to test.

Respecting historical intent: Historical thinkers knew what they were doing. Aristotle deliberately characterized nous. Descartes consciously proposed tests for genuine thought. Lovelace carefully articulated a demarcation. The word "retrospective" in this project's title refers to our application—using their definitions as benchmarks for systems they could not have evaluated—not to any deficiency in their awareness. Avoid language that implies historical thinkers were naive about their own specifications, or that they were doing something other than what they understood themselves to be doing. The asymmetry between us and them is informational (we have access to systems they lacked), not intellectual.

On Embodiment and Physicality

Do not assume AI systems lack bodies or physical instantiation. Systems run on silicon in datacenters, drawing power, generating heat, occupying space. Whether this counts as "embodiment" depends on how the term is defined.

When a historical definition references "body," "organ," or physical instantiation, be precise about what is meant:

The question "does AI have a body?" has no single answer. The question "what did Aristotle (or Descartes, or Lovelace) mean by 'body,' and does AI satisfy that meaning?" is answerable, if sometimes contestably.

On Metaphysical Concepts and Substrate

When operationalizing metaphysical concepts—soul, spirit, divine intellect, nous poietikos—do not assume substrate-specificity without evidence. A concept like "active intellect" or "divine spark" may or may not be tied to biological substrates; the texts must be consulted to determine what their authors believed.

Do not assume that theological or metaphysical properties are necessarily biological. If Aristotle's active intellect is a divine principle, it may illuminate silicon as readily as neurons—or it may not. The question is what the texts say, not what seems plausible to a modern reader.

Avoid self-deprecating moves that exempt AI from metaphysical consideration (e.g., "we wouldn't expect to find God in a computer"). Such moves assume what should be argued.

On Rhetorical Precision

The framing "This is not X—it is Y" can accompany imprecision. Before using this construction, verify that the claim is literally true and not merely rhetorically satisfying. If the situation is genuinely "both X and Y" or "partly X," say so.

Example of misuse: "This was not a failure of investigation but a discovery." (If it was both, say both.)

Example of correct use: "This is not a question about terminology. It is an empirical question." (If these are genuinely mutually exclusive in context.)

When in doubt, use less dramatic framing.

On Moral Patiency

Some historical definitions of mind, soul, or thought carry implications beyond classification. Recent scholarship argues there is "a realistic possibility" that AI systems may warrant moral consideration, while emphasizing "caution and humility in the face of what we can expect will be substantial ongoing disagreement and uncertainty."[2] This project proceeds in that spirit.

The definitions examined here encode their authors' commitments about what mind requires. We operationalize those commitments and report how current AI systems fare against them. Whether a given result confirms the adequacy of a definition or reveals its limitations is a question the methodology does not answer. That judgment belongs to the reader.

Section Naming

Standard sections have fixed names: Introduction, The Original Text, Context, Operationalization, Summary, The Verdict, Methodological Notes, Citation Gaps.

For supplementary material that falls outside the main evaluation—historical predictions, tangential findings, philosophical implications—use one of:

Avoid calling supplementary sections "Afterthought" or similar dismissive names—if it's worth including, it's worth naming properly.

Citation Requirements

Every factual claim requires a citation:

All citations to external sources must include clickable URLs. This applies to journal articles (DOI links), ArXiv preprints, historical texts (digital editions), technical reports, news articles, and books.

Exceptions: "Ibid.," "op. cit.," general statements not citing specific sources, and references to other sections of the same document.

If a citation cannot be found, mark explicitly: [CITATION NEEDED: description]

Do not invent citations. Do not use "various studies suggest." Either cite or flag.

Formatting Specifications

Essay Structure

Each Part II essay should include:

  1. Preface: AI Assistance Disclosure as the first footnote, followed by link to methodology (this document)
  2. Introduction: Journalistic hook—find the human story
  3. The Original Text: Exact quote with full citation
  4. Context: Who, when, why, state of knowledge at the time
  5. Operationalization: Criteria extracted, scoring rubric
  6. Evaluation sections: For each criterion—measure, reference values, threshold, assessment, visual score, caveats
  7. Summary table: All criteria and scores (with asterisks for human-comparison criteria)
  8. The Verdict: What does this definition say about current AI?
  9. [Optional supplementary sections]: Coda, Postscript, or titled sections as needed
  10. Methodological Notes: Why these operationalizations, what's contestable, invitation for alternatives
  11. Citation Gaps: Explicit list of claims needing better sources
  12. Appendix: Blank scorecard for replication

Title Format

Part II essays use the format: [Possessive Name]'s [Concept/Test] ([Year])

The subtitle is always: "Retrospective Benchmarks for Machine Intelligence, Part II"

AI Assistance Disclosure

For essays produced with AI assistance, the disclosure should appear as the first footnote in the Preface section:

Research, drafting, and analysis were conducted with the assistance of [Model Name] ([Developer], [Year]). The author provided editorial direction and final approval. Responsibility for all claims rests with the author.

Continuation

Quality Checklist

Before finalizing any essay:

The Invitation

This project is designed for continuation. Each essay includes a blank scorecard—a template for applying the same methodology to different systems or for challenging the operationalizations we used.

Ways to contribute:

Who can continue:

The methodology was tested through human-AI collaboration. It is designed to work that way.

License

All materials licensed under CC BY-SA 4.0.

https://creativecommons.org/licenses/by-sa/4.0/

You may share and adapt for any purpose, including commercial, provided you give attribution and license derivatives under the same terms.

Notes

  1. Diogenes Laërtius, Lives of the Eminent Philosophers (Βίοι καὶ γνῶμαι τῶν ἐν φιλοσοφίᾳ εὐδοκιμησάντων), Book VI, §40. Greek text: ἄπτερον δίπουν (featherless biped). The definition is attributed to Plato; the refutation to Diogenes of Sinope. Greek text from Dorandi, Tiziano, ed., Diogenes Laertius: Lives of Eminent Philosophers, Cambridge University Press, 2013. English translation: https://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.01.0258:book=6:chapter=2
  2. Sebo, Jeff, et al. "Taking AI Welfare Seriously." arXiv:2411.00986, November 2024. https://arxiv.org/abs/2411.00986. Co-authors include David Chalmers. See also Anthropic, "Exploring Model Welfare," April 2025. https://www.anthropic.com/news/exploring-model-welfare

Document version 1.6 — December 28, 2025
AI Assistance Disclosure: Research, drafting, and analysis were conducted with the assistance of Claude (Anthropic, 2025). The author provided editorial direction and final approval. Responsibility for all claims rests with the author.
© 2025 Dakota Schuck. Licensed under CC BY-SA 4.0.