Loading…
Friday May 30, 2025 3:30pm - 4:15pm CEST
Current methods for evaluating the safety of Large Language Models (LLMs) risk creating a false sense of security. Organizations deploying generative AI often rely on automated “judges” to detect safety violations like jailbreak attacks, as scaling evaluations with human experts is impractical. These judges—typically built with LLMs—underpin key safety processes such as offline benchmarking and automated red-teaming, as well as online guardrails designed to minimize risks from attacks. However, this raises a crucial question of meta-evaluation: can we trust the evaluations provided by these evaluators?

In this talk, we examine how popular LLM-as-judge systems were initially evaluated—typically using narrow datasets, constrained attack scenarios, and limited human validation—and why these approaches can fall short. We highlight two critical challenges: (i) evaluations in the wild, where factors like prompt sensitivity and distribution shifts can affect performance, and (ii) adversarial attacks that target the judges themselves. Through practical examples, we demonstrate how minor changes in data or attack strategies that do not affect the underlying safety nature of the model outputs can significantly reduce a judge’s ability to assess jailbreak success.

Our aim is to underscore the need for rigorous threat modeling and clearer applicability domains for LLM-as-judge systems. Without these measures, low attack success rates may not reliably indicate robust safety, leaving deployed models vulnerable to unseen risks.
Speakers
avatar for Francisco Girbal Eiras

Francisco Girbal Eiras

Machine Learning Research Scientist, DynamoAI
Francisco is an ML Research Scientist at Dynamo AI, a leading startup building enterprise solutions that enable private, secure, and compliant generative AI systems. He earned his PhD in trustworthy machine learning from the University of Oxford as part of the Autonomous Intelligent... Read More →
avatar for Eliott Zemour

Eliott Zemour

Senior ML Research Engineer, Dynamo AI
DR

Dan Ross

Head of AI Compliance Strategy, Dynamo AI
Dan Ross, Head of AI Compliance Strategy at Dynamo AI, focuses on aligning artificial intelligence, policy, risk and security management, and business application. Prior to Dynamo, Dan spent close to a decade at Promontory Financial Group, a premier risk and regulatory advisory firm... Read More →
Friday May 30, 2025 3:30pm - 4:15pm CEST
Room 114

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link