Learning and deploying safe and trustworthy models of data provenance

Our modern lives are increasingly governed by ubiquitous AI systems and an abundance of digital data. More and more products and services are providing us with better tools and recommendations for our professional, personal, and entertainment activities. With the clear impact that AI systems have in our decision making processes, we ask ourselves more frequently: how much can we trust a system’s recommendation? Where does the data on which the system based its decision come from? How did these data travel, and under which transformations, from their source to the user?

To answer these questions, various systems and models for data provenance representation, capture, and analysis have been proposed [1,2,3]. These models provide structured graph representations and ontologies to express fine-grained semantic relationships between data entities, activities, and agents involved in data workflows. Data provenance tracking systems can then use these models to record large-scale data provenance traces, and make trust assessments.

However, providing provenance traces at such fine-grained level by an AI system often comes with various limitations: (i) the system may have just learned statistical patterns in a data-driven way, thus being incapable of using provenance to explain the trustworthiness of its output; (ii) the system may not have any provenance recording means at all; and (iii) the system may not scale to record real-time provenance as this typically has a high performance cost due to its high volume and verbosity. How can we design modern AI systems that address these limitations, and are able to use provenance models to explain its behaviour and construct them at scale?

This PhD project combines provenance representation [1,2,3] with machine-learning architectures that have recently been successfully deployed in fields like representation learning [6] and natural language processing [5], adapting and using them to learn models for representing and predicting provenance-based explanations in a safe and trustworthy manner. In particular, it proposes methods to inject provenance ontological knowledge in the form of description logics into these models; to learn provenance instance data from large amounts of past provenance workflows [3]; and to represent provenance information through knowledge graph embeddings [4] that are suitable for GPU processing. These methods can be further applied in the project to generated safe and trusted explanations in general machine learning models, to better analyse and understand provenance workflows, to predict and complete partial provenance statements in real-time systems, and to reconstruct missing provenance information for building safer, more trustworthy AI systems.

[1] Groth, P., Jiang, S., Miles, S., Munroe, S., Tan, V., Tsasakou, S. and Moreau, L., 2006. An architecture for provenance systems.
[2] Missier, P., Belhajjame, K. and Cheney, J., 2013, March. The W3C PROV family of specifications for modelling provenance metadata. In Proceedings of the 16th International Conference on Extending Database Technology (pp. 773-776).
[3] Kuhn, T., Meroño-Peñuela, A., Malic, A., Poelen, J.H., Hurlbert, A.H., Ortiz, E.C., Furlong, L.I., Queralt-Rosinach, N., Chichester, C., Banda, J.M. and Willighagen, E., 2018, October. Nanopublications: a growing resource of provenance-centric scientific linked data. In 2018 IEEE 14th International Conference on e-Science (e-Science) (pp. 83-92). IEEE.
[4] Lin, Y., Liu, Z., Sun, M., Liu, Y. and Zhu, X., 2015, February. Learning entity and relation embeddings for knowledge graph completion. In Twenty-ninth AAAI conference on artificial intelligence.
[5] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[6] Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26.

Project ID

STAI-CDT-2023-KCL-21

Supervisor

Albert Meroño Peñuelahttps://www.albertmeronyo.org

Luc Moreau

Category

AI Provenance, Logic