The issue of machine learning trust is a pressing concern that has brought together multiple communities to tackle it. With the increasing use of tools such as ChatGPT and the identification of fairness issues, ensuring the reliability of machine learning is paramount to its continued development. In this project, we will focus on the low-level implementation of machine learning, an area that has been largely ignored by the community but has a significant impact on the reliability of major libraries and languages such as TensorFlow, Keras, PyTorch, Python, and R.
The project’s main idea is to test machine learning implementations for each different level of abstraction from the top language to the low-level libraries. For that, the student will start using a version of the “Gödel Test”: a method that parametrizes input generators for programs and controls the parameters to create testing strategies. Among the testing strategies, we will apply multiple test suite generation strategies, such as focused testing (i.e. testing new software’s components, which are common in the traditional machine learning libraries) and vulnerability unmasking. The student will design a system based on search strategies that will try to guide the algorithms to exhibit all of the possible branches of the machine learning code. For that, we will extend the testing framework of the MLighter tool, a holistic tool for evaluating the security, reliability and performance of machine learning, to deal with these specific problems.
Part 1: ML libraries testing. Often ML libraries’ creators are trained developers or professionals with qualifications in mathematics, statistics and AI, but yet with little training and understanding in programming languages. While much of the correctness of ML libraries depends on these skills, the lack of deep understanding in software engineering and programming languages can lead to buggy libraries due to the presence of unspecified and undefined behaviours (UB) in their code as these ML algorithms are written in C, C++ and Fortran and plugged as external libraries to Python and R. As a result, the non-UB free code can exhibit a wrong unknown behaviour of the ML library that leads to silent errors which may not be manifest until they are in production, provoking a catastrophic maintenance effort in the machine learning pipeline. By using differential testing with multiple compilers and code analysers, we will be able to unmask these errors while also considering flaky machine learning models (i.e. they do not provide a deterministic input/output behaviour but a probabilistic one), which we will model using information theory and entropy.
To the best of our knowledge, only a few works exist related to Python compiler fuzzing (and non for R compiler fuzzing) and ML libraries fuzzing (mainly focusing on a small subset of these libraries). None of these works suggested a holistic way of dealing with the reliability of machine learning libraries together with the compilers generating their executable binaries: given that a failure can be in any of the following parts (or a combination of it): (1) Python or R compiler, (2) ML library written in optimising compiler like C, and (3) the optimising compiler (like C). We discuss (1) and (3) in Part 2.
Part 2: lowest level of testing: testing the compilers. While much depends on the programmer’s knowledge of ML algorithm design, floating-point arithmetic and hardware implementations, failures caused by the compiler when it silently produces incorrect code, have a broad impact on the software and therefore are more severe. Yet, the current level of support for testing compilers commonly used with machine learning libraries is poor. We will extend the testing framework of MLighter to focus on these languages, i.e. Python and R, and their respective libraries in C, C++ and Fortran while considering multiple architectures (like, ARM and X-86 Intel).