Our modern lives are increasingly governed by ubiquitous AI systems and an abundance of digital data. More and more products and services are providing us with better tools and recommendations for our professional, personal, and entertainment activities. With the clear impact that AI systems have in our decision making processes, we ask ourselves more frequently: how much can we trust a system’s recommendation? Where does the data on which the system based its decision come from? How did these data travel, and under which transformations, from their source to the user?
To answer these questions, various systems and models for data provenance representation, capture, and analysis have been proposed [1,2,3]. These models provide structured graph representations and ontologies to express fine-grained semantic relationships between data entities, activities, and agents involved in data workflows. Data provenance tracking systems can then use these models to record large-scale data provenance traces, and make trust assessments.
However, providing provenance traces at such fine-grained level by an AI system often comes with various limitations: (i) the system may have just learned statistical patterns in a data-driven way, thus being incapable of using provenance to explain the trustworthiness of its output; (ii) the system may not have any provenance recording means at all; and (iii) the system may not scale to record real-time provenance as this typically has a high performance cost due to its high volume and verbosity. How can we design modern AI systems that address these limitations, and are able to use provenance models to explain its behaviour and construct them at scale?
This PhD project combines provenance representation [1,2,3] with machine-learning architectures that have recently been successfully deployed in fields like representation learning [6] and natural language processing [5], adapting and using them to learn models for representing and predicting provenance-based explanations in a safe and trustworthy manner. In particular, it proposes methods to inject provenance ontological knowledge in the form of description logics into these models; to learn provenance instance data from large amounts of past provenance workflows [3]; and to represent provenance information through knowledge graph embeddings [4] that are suitable for GPU processing. These methods can be further applied in the project to generated safe and trusted explanations in general machine learning models, to better analyse and understand provenance workflows, to predict and complete partial provenance statements in real-time systems, and to reconstruct missing provenance information for building safer, more trustworthy AI systems.