Ensuring Trustworthy AI through Verification and Validation in ML Implementations: Compilers and Libraries

The issue of machine learning trust is a pressing concern that has brought together multiple communities to tackle it. With the increasing use of tools such as ChatGPT and the identification of fairness issues, ensuring the reliability of machine learning is paramount to its continued development. In this project, we will focus on the low-level implementation of machine learning, an area that has been largely ignored by the community but has a significant impact on the reliability of major libraries and languages such as TensorFlow, Keras, PyTorch, Python, and R.

The project’s main idea is to test machine learning implementations for each different level of abstraction from the top language to the low-level libraries. For that, the student will start using a version of the “Gödel Test”: a method that parametrizes input generators for programs and controls the parameters to create testing strategies. Among the testing strategies, we will apply multiple test suite generation strategies, such as focused testing (i.e. testing new software’s components, which are common in the traditional machine learning libraries) and vulnerability unmasking. The student will design a system based on search strategies that will try to guide the algorithms to exhibit all of the possible branches of the machine learning code. For that, we will extend the testing framework of the MLighter tool, a holistic tool for evaluating the security, reliability and performance of machine learning, to deal with these specific problems.

Part 1: ML libraries testing. Often ML libraries’ creators are trained developers or professionals with qualifications in mathematics, statistics and AI, but yet with little training and understanding in programming languages. While much of the correctness of ML libraries depends on these skills, the lack of deep understanding in software engineering and programming languages can lead to buggy libraries due to the presence of unspecified and undefined behaviours (UB) in their code as these ML algorithms are written in C, C++ and Fortran and plugged as external libraries to Python and R. As a result, the non-UB free code can exhibit a wrong unknown behaviour of the ML library that leads to silent errors which may not be manifest until they are in production, provoking a catastrophic maintenance effort in the machine learning pipeline. By using differential testing with multiple compilers and code analysers, we will be able to unmask these errors while also considering flaky machine learning models (i.e. they do not provide a deterministic input/output behaviour but a probabilistic one), which we will model using information theory and entropy.
To the best of our knowledge, only a few works exist related to Python compiler fuzzing (and non for R compiler fuzzing) and ML libraries fuzzing (mainly focusing on a small subset of these libraries). None of these works suggested a holistic way of dealing with the reliability of machine learning libraries together with the compilers generating their executable binaries: given that a failure can be in any of the following parts (or a combination of it): (1) Python or R compiler, (2) ML library written in optimising compiler like C, and (3) the optimising compiler (like C). We discuss (1) and (3) in Part 2.

Part 2: lowest level of testing: testing the compilers. While much depends on the programmer’s knowledge of ML algorithm design, floating-point arithmetic and hardware implementations, failures caused by the compiler when it silently produces incorrect code, have a broad impact on the software and therefore are more severe. Yet, the current level of support for testing compilers commonly used with machine learning libraries is poor. We will extend the testing framework of MLighter to focus on these languages, i.e. Python and R, and their respective libraries in C, C++ and Fortran while considering multiple architectures (like, ARM and X-86 Intel).

– MLighter is an on-going project with a webpage: http://mlighter.freedevelop.org and a publication: Menendez, Hector D. (2022). Measuring Machine Learning Robustness in front of Static and Dynamic Adversaries. In Measuring Machine Learning Robustness in front of Static and Dynamic Adversaries. IEEE 34rd International Conference on Tools with Artificial Intelligence (ICTAI).

– Mutation testing for core compiler functionality: Even-Mendoza, Karine, and Sharma, Arindam, and Donaldson, Alastair, and Cadar Cristian (under review). GrayC: Greybox Fuzzing of Compilers and Analysers for C. https://zenodo.org/record/7643912#.ZD08L5PMK3J

– ML libraries testing: Wei, Anjiang, Yinlin Deng, Chenyuan Yang, and Lingming Zhang. “Free lunch for testing: Fuzzing deep-learning libraries from open source.” In Proceedings of the 44th International Conference on Software Engineering, pp. 995-1007. 2022.

– Mutation testing for (also) Python compiler: Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2017. Synthesizing program input grammars. In Proceedings of the 38th Conference on Programming Language Design and Implementation, Vol. 52. 95–110.

– CompCert: Leroy, X. (2021). The CompCert C verified compiler: Documentation and user’s manual (Doctoral dissertation, Inria).

– Gödel Test: Poulding, S., & Feldt, R. (2014, July). Generating structured test data with specific properties using nested monte-carlo search. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation (pp. 1279-1286).

– Junjie Chen, Jibesh Patra, Michael Pradel, Yingfei Xiong, Hongyu Zhang, Dan Hao, and Lu Zhang. 2020. A Survey of Compiler Testing. ACM Comput. Surv. 53, 1, Article 4 (January 2021), 36 pages. https://doi.org/10.1145/3363562

Project ID

STAI-CDT-2023-KCL-30