Planning for Exploration in Reinforcement Learning

Exploration in Reinforcement Learning (RL) consists in taking actions that are currently considered suboptimal by the agent. The agent continually adjusts its estimate of the value of actions and, if a new action proves to lead to a higher expected value, the agent changes its behaviour making the new action part of its “policy”. Exploration is necessary for learning, since only by continually evaluating actions it is possible to establish whether they would be better than the current behaviour. However, trying all actions indefinitely is physically and computationally impossible in tasks of practical interest.

Several families of approaches have been developed to shape exploration to promising actions by leveraging advice from humans, knowledge from models of the environment, or from previously solved tasks. Nonetheless, random exploration still has a prominent role in practice. While randomness has interesting links to creativity, it is generally inefficient. In this project we will study how to use and improve models over time to form hypothesis of what actions are most promising so as to drive exploration.

Most of the initial work on combining planning and RL relied on the assumption that the model would become eventually correct. In complex tasks this cannot be guaranteed, as models are subject to abstraction and approximation. Therefore, we will study appropriate synergies of reasoning and model-free learning, to let the behaviour improve beyond the inaccuracy of the model, while being driven by the knowledge in the model. My previous work in this research direction used symbolic planning to constrain the actions to be considered in exploration (Leonetti et al., 2016), and motion planning to shape the distribution of continuous actions to explore (Bejjani et al., 2021). In both these cases, however, some for of random exploration is still employed. Another recent, relevant, and interesting research line considered learning from humans, to create symbolic models that captures the shortcuts that evolution has built in human behaviour (Hasan et al.,2020). One of the central challenges in using planning for exploration is determining what model representation and planning algorithm are compact and efficient enough, while enabling generalisation and possibly hierarchical reasoning. A key question is what aspects of the environment must be modelled, while others can be dealt with by the model-free learning.

The aforementioned work will be the starting point into research on efficient model adaptation and its use for planning in RL, with the goal of achieving planning agents able to continuously improve, and solve a range of increasingly complex problems. We will start from existing benchmarks, such as the environments developed within OpenAI Gym, and later consider real-world robot learning, with tasks inspired by RoboCup@Home and the European Robotics League.

  • Matteo Leonetti, Luca Iocchi, and Peter Stone. “A synthesis of automated planning and reinforcement learning for efficient, robust decision-making.” Artificial Intelligence 241 (2016).
  • Wissam Bejjani, Matteo Leonetti, and Mehmet R. Dogar. “Learning image-based Receding Horizon Planning for manipulation in clutter.” Robotics and Autonomous Systems 138 (2021).
  • Mohamed Hasan, Matthew Warburton, Wisdom C. Agboh, Mehmet R. Dogar, Matteo Leonetti, He Wang, Faisal Mushtaq, Mark Mon-Williams, and Anthony G. Cohn “Human-like planning for reaching in cluttered environments.” 2020 IEEE International Conference on Robotics and Automation (ICRA) (2020).

Project ID

STAI-CDT-2022-KCL-4