As large language models (LLMs) become increasingly central to AI systems, reliable evaluation of their outputs has emerged as a critical challenge. Recent advances demonstrate that LLMs themselves can serve as effective evaluators—or “LLM-as-a-Judge”—particularly when trained to reason systematically. This talk examines how reinforcement learning (RL) can enhance reasoning depth in evaluator models.
The session will begin with an analysis of current LLM-based evaluation approaches and their limitations such as in addressing judgment biases. The presentation will cover a progression of work including Self-Taught Evaluators and EvalPlanner before introducing the J1 framework. This approach applies unified reinforcement learning to train judgment models using verifiable rewards that incentivize chain-of-thought reasoning.
The J1 framework achieves state-of-the-art results across multiple benchmarks, outperforming larger models like DeepSeek-R1 and o1-mini, even at 8B and 70B parameter scales. The talk will explore key ablation studies—including pairwise versus pointwise training, online versus offline learning, and the impact of reward design, prompt structure, and reasoning length on evaluation quality. The talk will conclude with future directions, including how LLM-as-a-Judge can be better used for Reward Modelling in post-training.