Automatic Testing and Fixing Learning-based Conversational Agents with Knowledge Graphs

Background: Learning-based conversational agents can generate conversations that violate basic logical rules and common sense, which can seriously affect user experience and lead to mistrust and frustration. To create accurate, smart, and trustworthy conversational agents, it is essential to adequately evaluate conversational agents. Existing evaluation methods mainly rely on goal completion rate, customer satisfaction, or automatic similarity comparison between conversational agent responses and human responses. These methods are labour-intensive, slow, and not scalable. There are few works to automatically and systematically test learning-based conversational agents.

Proposal: This project aims to automatically test and improve learning-based conversational agents via knowledge graphs. We will develop knowledge graph-based test input generation techniques for conversational agents. We will also design test coverage criteria based on graph coverage criteria. The test oracles are derived from knowledge graphs as well as metamorphic testing techniques. The automatically generated test inputs and oracles will also be used to augment training data or fine-tune the model to improve the performance of conversational agents.

In addition, the test oracle deriving approach can also aid real-time testing and fixing of agent conversations. Different from offline testing mentioned above, the response of agents to user inputs is compared against the test oracle derived from the knowledge graphs. Once the test oracle is violated, we use the test oracle to guide response editing to produce more correct and trustworthy agent responses which obey logical rules and common sense.

WP1: Knowledge graph-based test input generation for learning-based conversational agents. This package focuses on testing conversational agents offline. The primary contents are knowledge graph-based coverage criteria, graph node mutation for test input generation, and automatic test oracles through metamorphic relations.

WP2: Training data augmentation for more logical and trustworthy conversational agents. This package uses the generated test inputs and the derived test oracles from WP1 to either augment training data or fine-tune the learning model to improve the performance of conversational agents.

WP3: Testing and fixing chatbot conversations on the fly. This package targets the testing and fixing of agent conversations with real users. Instead of generating test inputs to systematically test conversational agents, this package checks whether the chatbots under test respond logically to real user inputs. To fulfill real-time testing, test oracles will be automatically generated through knowledge graph retrieval. Once a non-logical response is detected, the response will be fixed and guided by the test oracles. The repaired response will replace the original buggy response to be handed over to end users.

Maroengsit, Wari, Thanarath Piyakulpinyo, Korawat Phonyiam, Suporn Pongnumkul, Pimwadee Chaovalit, and Thanaruk Theeramunkong. “A survey on evaluation methods for chatbots.” In Proceedings of the 2019 7th International conference on information and education technology, pp. 111-119. 2019.

Liu, Haochen, Jamell Dacon, Wenqi Fan, Hui Liu, Zitao Liu, and Jiliang Tang. “Does Gender Matter? Towards Fairness in Dialogue Systems.” In Proceedings of the 28th International Conference on Computational Linguistics, pp. 4403-4416. 2020.

RADZIWILL, NICOLE, and MORGAN BENTON. “Evaluating Quality of Chatbots and Intelligent Conversational Agents.” Software Quality Professional 19, no. 3 (2017).

Project ID



Dr Jie Zhang

Dr Mohammad Mousavi


Norms, Verification