Helping AI Write Better Code: New Research from Shuyin Ouyang Tackles a Key Challenge in Trustworthy AI

5th June 2026 | News, Student News

News > Helping AI Write Better Code: New Research from Shuyin Ouyang Tackles a Key Challenge in Trustworthy AI

As artificial intelligence becomes an everyday tool for writing computer code, a key challenge remains: how can we trust AI systems to reuse existing code and tools accurately, without making critical mistakes?

New research from PhD researcher Shuyin Ouyang and his co-authors is addressing this issue by evaluating how large language models (LLMs) generate and reuse data science code, a specialised form of coding that relies heavily on reusing existing software libraries.

Building a new benchmark for data science coding

In many areas of programming, developers write large sections of code from scratch. Data science, however, works differently. Much like academic researchers citing previous studies or lawyers referencing previous cases, data scientists often rely on collections of existing code and use them to analyse data, build models, and produce visual results.

Shuyin explains this using an analogy: “It’s like writing a research paper. You don’t rewrite all the previous research, you reference the titles of the papers.” The problem is that LLMs don’t always ‘reference’ or reuse the collections of existing code properly. While LLMs have become highly capable at general coding, they often struggle with the fine details of data science workflows. For example, they may reference the wrong function, invent a new code or apply incorrect settings. These errors are part of what researchers call ‘hallucinations’.

To address this, Shuyin and his co-authors have developed a new benchmark, ‘DSCodeBench’, designed to test how well LLMs code using existing data science libraries. The benchmark acts as a structured test, allowing researchers to measure how accurately AI systems perform this kind of reference-heavy coding.

Keeping pace with fast-moving AI

One of the most striking findings from the research for Shuyin has been just how quickly LLMs are improving. During the nine months it took to develop the benchmark, Shuyin and his co-authors had to revise it multiple times as newer models of LLMs rapidly surpassed earlier performance levels.

“What worked as a difficult test at the beginning became too easy for the LLMs just a few months later,” Shuyin explains. This meant that Shuyin and his co-authors had to try and keep the benchmarks moving in the same direction as the LLMs to make sure the benchmark was relevant.

Despite lots of improvement in the performance of the LLMs, the benchmark showed something surprising: some weaknesses persisted across different generations of LLMs. For Shuyin, this signals an opportunity for both him and the wider research community where future research and improvements are needed.

From code generation to the whole software development pipeline

Shuyin’s PhD research goes beyond code generation alone. His broader research explores how AI might eventually support the entire software development process, from idea generation, code generation and repair, and long-term maintenance.

Today, many benchmarks focus only on whether AI can write snippets of code. Shuyin argues that the next step is evaluating whether models can understand user requirements, plan solutions, and integrate software components responsibly.

Using LLMs across the entire software development lifecycle could significantly reduce the time spent on repetitive and time-consuming coding tasks, allowing developers to focus more on creativity and exploring new research areas. This also relates in a broader sense to ‘vibe-coding’, where people without deep programming knowledge describe what they want to an AI system which then generates the applications or tools for them. However, as Shuyin emphasises, this approach still requires careful evaluation and strong human oversight to ensure it remains reliable and trustworthy.

Why this research matters

At its core, Shuyin’s work contributes to making AI more trustworthy. As more people, including students, researchers, and non-experts, rely on AI generated code, mistakes become harder to detect and potentially more harmful.

By improving how we evaluate AI coding tools, this research helps ensure models do exactly what users ask, without hidden errors, unexpected behaviour, or unsafe shortcuts.

Notably, Shuyin and his co-authors also applied their research in practice, with the benchmark itself developed using AI tools alongside thorough human verification. This hybrid approach mirrors the future Shuyin advocates for, where AI is a powerful assistant, but humans are firmly “in the loop”.

After being presented at the 40th AAAI Conference on Artificial Intelligence (AAAI 2026), it prompted discussion with researchers from technology companies who are working on related challenges in developing and evaluating large language models.

For Shuyin, this response confirms they are working in the right direction, considering the importance of the research for academia and industry. “It’s very motivating and I feel very proud of what we’re doing” he says. “It shows this problem matters, not just in theory, but in real-world applications”.

Safe and Trusted AI research community

For Shuyin, being part of the UKRI Centre for Doctoral Training (STAI CDT) has provided him with a supportive research environment that has shaped both the direction and development of his work. Through seminars, conferences, and regular interaction with peers working on different applications of large language models, e.g. in robotics, he has been able to share ideas, gain new perspectives, and reflect critically on his own research. The cohortbased structure has helped him avoid working in isolation, encouraged crossdisciplinary thinking, and supported the ongoing refinement of his research as the field continues to develop rapidly.

Shuyin is passionate about being part of and contributing to the wider safe and trusted AI research community. We are proud of his research and look forward to sharing his future insights.