Trusted Test Suites for Safe Agent-Based Simulations

Agent-based models (ABMs) are an AI technique to help improve our understanding of complex real-world interactions and their “emergent behaviours”. ABMs are used to develop and test theories or to explore how interventions might change behaviour. For example, we are working on a model of staff and patient interaction in emergency medicine exploring how interventions affect efficiency and safety. With the Francis Crick Institute, we study how cells coordinate and manage the growth of blood vessels.  

To create trust in the results of ABM simulations, assurances are needed about their correctness. This requires a systematic approach to validation and verification of ABMs that can be clearly documented as part of an overall fitness-for-purpose argument documenting and explaining why the ABM is a sufficiently accurate representation of reality and why its results should be trusted and can form a meaningful basis for expert conclusions and real-world interventions. Many (though not all) types of ABM validation and verification can be thought of as a form of testing for the ABM system. From software engineering, we know that building automated test suites is a key foundation for quality assurance, as they provide a strong safeguard against regressions: the accidental introduction of a problem in one part of a complex piece of software through changes made in an, apparently, unrelated part of the software.  

Testing ABMs is fundamentally different from testing of other software systems: 

  1. While some aspects of ABM testing, similar to unit-testing in software engineering, aim to establish the correct implementation of various functionalities in the ABM, the majority of ABM validation and verification tasks are concerned with establishing a fitness-for-purpose argument about the relationship between the ABM and the reality it models. This requires the ability to trace domain hypotheses directly into test cases, their execution as simulation experiments, and the results from these experiments, together with automated analysis of the results in relation to the original hypothesis [1]. 
  1. Simulations are stochastic processes; different runs will produce different results. The simulation must produce meaningful results not just for one run, but across multiple runs in a statistically significant manner. Understanding how many runs to execute and which parameters to vary across these runs to obtain statistical significance is non-trivial. 
  1. Establishing what constitutes successful test runs is, itself, non-trivial: different from typical software unit tests, we are looking for complex, often temporal properties to hold over traces of the states and state changes for large sets of interacting agents. The source information to be evaluated is contained in textual logs from simulation runs, showing information often at levels of granularity different from those required for test evaluation.  

Previous work on the SPARTAN [] and MC2MABS [] tools is beginning to address some of the more technical challenges. However, the tie-in with an overall fitness-for-purpose argument that can be inspected and challenged by domain experts who may not be experts in ABMs or, indeed, programming remains an open research gap. Providing support for the precise specification of such fitness-for-purpose arguments in a traceable form that can be automatically translated into tests (simulation experiments) is critically to advancing the field of agent-based modelling. 

The aim of this PhD project is to develop a domain-specific modelling approach that will allow domain experts to express hypotheses and properties to be automatically tested in a language that is close to the problem domain and that can be automatically translated into an automated test suite building on top of SPARTAN and MC2MABS, without exposing end users to the details of implementation of the ABM or the specifics of how simulation runs are encoded in simulation-engine log files. This should then be embedded in a larger model of a fitness-for-purpose argument, as in the initial vision presented in [1]. This contributes to safe AI, by making ABMs more reliable. At the same time, it increases trust in the simulation results because the tests executed can be inspected, understood, and manipulated by domain experts instead of only by technical experts. 

[1]Steffen Zschaler and Fiona Polack: A Family of Languages for Trustworthy Agent-Based Simulation. 13th International Conference on Software Language Engineering, 2020. 

Project ID