Safe Reinforcement Learning from Human Feedback

Reinforcement learning (RL) has become a new paradigm for solving complex decision-making problems. However, it presents numerous safety concerns in real world decision making, such as unsafe exploration, unrealistic reward function, etc. As reinforcement learning agents are frequently evaluated in terms of rewards, it is less noticed that designing AI agents that have the capability to achieve arbitrary objectives can be deficient, in that the systems are intrinsically unpredictable and might result in negative and irreversible outcomes to humans. While humans understand the dangers, human involvement in the agent’s learning process can be promising to boost AI safety for being more aligned with human values [1].

Dr. Du’s early research [2] shows that human preference can be used as an effective replacement for reward signals. One recent attempt [1] also adopted human preference as a replacement for reward signals, to guide the training of agents in safety-critical environments; while agents query humans with a certain probability, how to actively query humans and adapt its knowledge to the task and query is not considered.

This project considers how to build safe RL agents leveraging human feedback and aims to address two challenges: 1) how to enable agents to actively query humans with efficiency thus minimising disturbance to humans; 2) how to improve algorithms’ robustness in dealing with large state space and even unseen tasks. The target of this project is to realise human value alignment safe RL in a scalable (in terms of task scale) and efficient (in terms of human involvement) way.

To address these challenges, this research will leverage the principles of the Abstract Interpretation framework [3], a theory that dictates how to obtain sound, computable, and precise finite approximations of potentially infinite sets of behaviours. Based on the abstraction of states, we aim to enable agents to build a knowledge base for (un)safe behaviours, and thus construct a scheme for when to actively query humans. Due to the nature of sequential decision making, this project will also consider temporal abstractions of behaviours and feedback to improve the consistency in safety control. Furthermore, by the effective abstractions, we aim to make the neural-network based agents invariant to task-irrelevant details, and thus generalizable to new downstream tasks.

[1] Ilias Kazantzidis, Tim Norman, Yali Du, Christopher Freeman. How to train your agent: Active learning from human preferences and justifications in safety-critical environments. AAMAS 2022.
[2] Runze Liu, Fengshuo Bai, Yali Du, Yaodong Yang. Meta-Reward-Net: Implicitly Differentiable Reward Learning for Preference-based Reinforcement Learning. NeurIPS 2022.
[3] Cousot, P. and Cousot, R. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Symposium on Principles of Programming Languages (POPL), 1977.

Project ID



Yali Du