Incentivizing Thinking in LLM-as-a-Judge and Reward Modelling via Reinforcement Learning

Dr Chenxi Whitehouse

14 July 2025

10:30 am - 1:15 pm

Please note this event is part of Safe & Trusted AI Summer School.

As large language models (LLMs) become increasingly central to AI systems, reliable evaluation of their outputs has emerged as a critical challenge. Recent advances demonstrate that LLMs themselves can serve as effective evaluators—or “LLM-as-a-Judge”—particularly when trained to reason systematically. This talk examines how reinforcement learning (RL) can enhance reasoning depth in evaluator models.

The session will begin with an analysis of current LLM-based evaluation approaches and their limitations such as in addressing judgment biases. The presentation will cover a progression of work including Self-Taught Evaluators and EvalPlanner before introducing the J1 framework. This approach applies unified reinforcement learning to train judgment models using verifiable rewards that incentivize chain-of-thought reasoning.

The J1 framework achieves state-of-the-art results across multiple benchmarks, outperforming larger models like DeepSeek-R1 and o1-mini, even at 8B and 70B parameter scales. The talk will explore key ablation studies—including pairwise versus pointwise training, online versus offline learning, and the impact of reward design, prompt structure, and reasoning length on evaluation quality. The talk will conclude with future directions, including how LLM-as-a-Judge can be better used for Reward Modelling in post-training.

About the speaker

Dr Chenxi Whitehouse is a research scientist at Meta, where she focuses on the evaluation of Large Language Models. She is also a visiting researcher at the University of Cambridge, where she previously worked as a postdoctoral research associate on automated fact-checking. Before joining Meta, Chenxi was an applied research scientist at Amazon AGI. She holds a PhD in knowledge-grounded natural language processing from City, University of London. Chenxi is an active contributor to the NLP community, with publications and reviewing experience at top venues including ACL, EMNLP, and NeurIPS.

Incentivizing Thinking in LLM-as-a-Judge and Reward Modelling via Reinforcement Learning

About the speaker

Programme

People

Partnerships