Extracting interpretable symbolic representations from neural networks using information theory and causal abstraction

Neurosymbolic systems seek to combine the strengths of two major classes of AI algorithms: neural networks, able to recognise patterns in unstructured data, and logic-based systems, capable of powerful reasoning. One of the bottlenecks of current neurosymbolic systems is our ability to create an accurate mapping between the noisy, multivariate activity of neural representations and the discrete symbols needed for logic-based reasoning. Although some proposals exist, they often require domain knowledge, which limits their applicability in general-purpose learning algorithms.  
 
In an attempt to solve this problem, the main goal of this project is to provide a principled means to extract discrete representations from neural networks that can subsequently be used for symbolic reasoning. Specifically, the method will be based on mathematical tools from two fields: multivariate information theory, used to describe the statistics of large sets of neurons; and causal abstraction, used to describe causal relationships between ‘macroscopic’ patterns of neural activations. Together, these tools will enable us to build symbolic representations that are maximally informative of the causal relationships within the network, and can therefore be used both to interpret the decisions of the network and to bridge with symbolic reasoning systems. Finally, the proposed approach will be evaluated in comparison to existing alternatives in terms of their usefulness for downstream logic-based AI algorithms.  

Beckers et al. (2019). Approximate Causal Abstraction. UAI, https://arxiv.org/abs/1906.11583  
 
Chalupka et al. (2016). Multi-Level Cause-Effect Systems. AISTATS, https://arxiv.org/abs/1512.07942  

Project ID

STAI-CDT-2023-IC-4

Supervisor

Pedro Medianohttps://pmediano.gitlab.io

Category

Logic, Norms, Reasoning