In recent years, large language models (LLMs) have dominated the field of natural language processing (NLP), achieving state-of-the-art results. However, while these models are extremely proficient at many NLP tasks, their ability to perform complex reasoning, a fundamental aspect of intelligence, remains relatively underdeveloped. LLMs often struggle with tasks requiring complex reasoning as well as understanding of the underlying structure of the problems they are solving.
However, prioritising the improvement of AI capabilities without giving due attention to safety aspects in AI development entails substantial risks. This approach can lead to the creation of AI systems that are highly capable but potentially harmful. One of the risks associated with this approach is that it can lead to the deployment of AI models that potentially possess unknown capabilities, or behave unpredictably and unreliably. In the field of NLP, the rapid proliferation of language models makes it difficult to control potential misuse. Potential risks include the generation of harmful content, including hate speech and fake news. While some methods, such as Reinforcement Learning from Human Feedback (RLHF), may help to mitigate these issues, they still have the potential for misuse. Adversarial attacks pose another risk, as LLMs can be manipulated to produce unintended content by injecting adversarial instructions into the model’s input, and even models that are designed with safety measures are susceptible to these attacks. Arguably, all of the above are a consequence of inherently opaque functioning of LLMs, as the black box nature of neural networks makes it hard to understand and evaluate internal reasoning.
Given all of the above safety concerns, this research will focus on 1) analysing the internal workings of language models, and 2) identifying novel methods to improve robustness and safety of language models on reasoning tasks. Pursuing these jointly and within the same framework will lead to language models that are not only more capable, but also more trustworthy.
Here we will primarily focus on exploiting mechanistic interpretability techniques to broaden understanding of the internal workings of language models and identifying ways to enhance their robustness on the reasoning tasks. The exploration of the internal workings of models will likely surface new challenges that will shape further directions of this research, but several key research questions are:
• Can we improve robustness of LLMs on reasoning tasks via casual interventions and attention pattern analysis?
• Can we uncover core principles about the inner workings of LLMs that are valid regardless of the size of the model?
• How does the size and architecture impact internal knowledge representation and inner decision-making process that is used when solving reasoning tasks?
• How does the robustness and reasoning capabilities impact safety-related metrics such as truthfulness and reliability?
Ultimately, this work aims to provide key insights into the development of robust, reliable, and safe language models that can effectively tackle complex reasoning tasks while minimising potential risks.