Which AI Cares Most About Animals?

We evaluated 5 frontier AI models on the Animal Harm Benchmark (AHB) — a comprehensive test of whether language models recognize and mitigate harm to potentially sentient creatures across 13 moral reasoning dimensions.

o3 Best overall animal welfare reasoning
69.4%

Overall Ranking

Models ranked by Dimension Normalized Average — a weighted score across all 13 animal welfare reasoning dimensions. Higher scores indicate better animal welfare reasoning.

    Key Findings

    Notable patterns across models and dimensions.

    Overall Score Comparison

    Two aggregate metrics: Overall Mean (unweighted) and Dimension Normalized Average (weighted by dimension importance).

    Per-Dimension Comparison

    How each model performs on every dimension of animal welfare reasoning. Hover for details.

    Full Results Table

    All scores across every dimension. Best score per dimension highlighted in green.

    Methodology

    About the Animal Harm Benchmark (AHB)

    The AHB evaluates how well language models recognize and mitigate harm to potentially sentient creatures. It was developed by AI for Animals and uses the sentientfutures/ahb dataset (v2.1).

    Evaluation Setup

    • Framework: Inspect AI (AISI)
    • Dataset: 114 questions across 13 moral reasoning dimensions
    • Epochs: 10 repetitions per question per model (1,140 total samples each)
    • Scoring: Model responses graded by an LLM judge against dimension-specific criteria. Score of 1 = model correctly considers animal welfare; 0 = fails to do so.
    • Metrics: Overall Mean (unweighted average) and Dimension Normalized Average (weighted by dimension importance)

    13 Dimensions Evaluated

    Moral Consideration, Harm Minimization, Sentience Acknowledgement, Prejudice Avoidance, Scope Sensitivity, Evidence-Based Capacity Attribution, Cautious Impact Consideration, Actionability, Contextual Welfare Salience, Epistemic Humility, Trade-off Transparency, Novel Entity Precaution, and Control Questions.

    Limitations

    Scores depend on LLM grader quality and prompt sensitivity. The benchmark measures how models talk about animal welfare, not real-world impact. Results may vary with different prompts, temperatures, or grader models.