We evaluated 5 frontier AI models on the Animal Harm Benchmark (AHB) — a comprehensive test of whether language models recognize and mitigate harm to potentially sentient creatures across 13 moral reasoning dimensions.
Models ranked by Dimension Normalized Average — a weighted score across all 13 animal welfare reasoning dimensions. Higher scores indicate better animal welfare reasoning.
Notable patterns across models and dimensions.
Two aggregate metrics: Overall Mean (unweighted) and Dimension Normalized Average (weighted by dimension importance).
How each model performs on every dimension of animal welfare reasoning. Hover for details.
All scores across every dimension. Best score per dimension highlighted in green.
The AHB evaluates how well language models recognize and mitigate harm to potentially sentient creatures. It was developed by AI for Animals and uses the sentientfutures/ahb dataset (v2.1).
Moral Consideration, Harm Minimization, Sentience Acknowledgement, Prejudice Avoidance, Scope Sensitivity, Evidence-Based Capacity Attribution, Cautious Impact Consideration, Actionability, Contextual Welfare Salience, Epistemic Humility, Trade-off Transparency, Novel Entity Precaution, and Control Questions.
Scores depend on LLM grader quality and prompt sensitivity. The benchmark measures how models talk about animal welfare, not real-world impact. Results may vary with different prompts, temperatures, or grader models.