Towards Robust Reasoning in Language Models via Mechanistic Interpretability

In recent years, large language models (LLMs) have dominated the field of natural language processing (NLP), achieving state-of-the-art results. However, while these models are extremely proficient at many NLP tasks, their ability to perform complex reasoning, a fundamental aspect of intelligence, remains relatively underdeveloped. LLMs often struggle with tasks requiring complex reasoning as well as understanding of the underlying structure of the problems they are solving.

However, prioritising the improvement of AI capabilities without giving due attention to safety aspects in AI development entails substantial risks. This approach can lead to the creation of AI systems that are highly capable but potentially harmful. One of the risks associated with this approach is that it can lead to the deployment of AI models that potentially possess unknown capabilities, or behave unpredictably and unreliably. In the field of NLP, the rapid proliferation of language models makes it difficult to control potential misuse. Potential risks include the generation of harmful content, including hate speech and fake news. While some methods, such as Reinforcement Learning from Human Feedback (RLHF), may help to mitigate these issues, they still have the potential for misuse. Adversarial attacks pose another risk, as LLMs can be manipulated to produce unintended content by injecting adversarial instructions into the model’s input, and even models that are designed with safety measures are susceptible to these attacks. Arguably, all of the above are a consequence of inherently opaque functioning of LLMs, as the black box nature of neural networks makes it hard to understand and evaluate internal reasoning.

Given all of the above safety concerns, this research will focus on 1) analysing the internal workings of language models, and 2) identifying novel methods to improve robustness and safety of language models on reasoning tasks. Pursuing these jointly and within the same framework will lead to language models that are not only more capable, but also more trustworthy.

Here we will primarily focus on exploiting mechanistic interpretability techniques to broaden understanding of the internal workings of language models and identifying ways to enhance their robustness on the reasoning tasks. The exploration of the internal workings of models will likely surface new challenges that will shape further directions of this research, but several key research questions are:

• Can we improve robustness of LLMs on reasoning tasks via casual interventions and attention pattern analysis?

• Can we uncover core principles about the inner workings of LLMs that are valid regardless of the size of the model?

• How does the size and architecture impact internal knowledge representation and inner decision-making process that is used when solving reasoning tasks?

• How does the robustness and reasoning capabilities impact safety-related metrics such as truthfulness and reliability?

Ultimately, this work aims to provide key insights into the development of robust, reliable, and safe language models that can effectively tackle complex reasoning tasks while minimising potential risks.

[1] M. Anderljung, J. Barnhart, A. Korinek, J. Leung, C. O’Keefe, J. Whittlestone, S. Avin, M. Brundage, J. Bullock, D. Cass-Beggs, B. Chang, T. Collins, T. Fist, G. Hadfield, A. Hayes, L. Ho, S. Hooker, E. Horvitz, N. Kolt, J. Schuett, Y. Shavit, D. Siddarth, R. Trager, and K. Wolf. Frontier ai regulation: Managing emerging risks to public safety, 2023.
[2] E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21. Association for Computing Machinery, 2021.
[3] T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Deni- son, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield- Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah. Towards monosemanticity: Decomposing language models with dictionary learning, 2023.
[4] P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences, 2023.
[5] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems, 2021.
[6] N. Dziri, X. Lu, M. Sclar, X. L. Li, L. Jiang, B. Y. Lin, P. West, C. Bhagavatula, R. L. Bras, J. D. Hwang, S. Sanyal, S. Welleck, X. Ren, A. Ettinger, Z. Harchaoui, and Y. Choi. Faith and fate: Limits of transformers on compositionality, 2023.
[7] N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah. Toy models of superposition, 2022.
[8] A. Gopal, N. Helm-Burger, L. Justen, E. H. Soice, T. Tzeng, G. Jeyapragasan, S. Grimm, B. Mueller, and K. M. Esvelt. Will releasing the weights of large language models grant widespread access to pandemic agents?, 2023.
[9] A. Helbling, M. Phute, M. Hull, and D. H. Chau. Llm self defense: By self examination, llms know they are being tricked, 2023.
[10] Z. Li, B. Peng, P. He, and X. Yan. Do you really follow me? adversarial instructions for evaluating the robustness of large language models, 08 2023.
[11] S. Lin, J. Hilton, and O. Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022.
[12] J. Liu, L. Cui, H. Liu, D. Huang, Y. Wang, and Y. Zhang. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning, 2020.
[13] K. Meng, D. Bau, A. Andonian, and Y. Belinkov. Locating and editing factual associations in gpt, 2023.
[14] C. Olaf. Mechanistic interpretability, variables, and the importance of interpretable bases, 2022.
[15] Y. Pan, L. Pan, W. Chen, P. Nakov, M.-Y. Kan, and W. Y. Wang. On the risk of misinformation pollution with large language models, 2023.
[16] T. Sawada, D. Paleka, A. Havrilla, P. Tadepalli, P. Vidas, A. Kranias, J. J. Nay, K. Gupta, and A. Komatsuzaki. Arb: Advanced reasoning benchmark for large language models, 2023.
[17] A. Scherlis, K. Sachan, A. S. Jermyn, J. Benton, and B. Shlegeris. Polysemanticity and capacity in neural networks, 2023.
[18] A. Talmor, J. Herzig, N. Lourie, and J. Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019.
[19] S. Tan, S. Joty, K. Baxter, A. Taeihagh, G. A. Bennett, and M.-Y. Kan. Reliability testing for natural language processing systems, 2021.
[20] E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh. Universal adversarial triggers for attacking and analyzing nlp, 2021.
[21] K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small, 2023.
[22] W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang et al: A human-centric benchmark for evaluating foundation models, 2023.

Project ID

STAI-CDT-2024-KCL-15