Ziyan Wang

Ph.D. Candidate · Cooperative AI Lab · King's College London

Email: ziyan.wang[at]kcl[dot]ac[dot]uk

Research overview

I am a fourth-year Ph.D. candidate at the Cooperative AI Lab, King's College London, supervised by Dr Yali Du and Prof. Sanjay Modgil. My work studies how learning agents can coordinate, communicate, and act safely in complex environments.

My central question is how to distill executable policies from human knowledge. Human knowledge appears as thinking patterns, direct instruction, books, and collective behavior; my work asks how learning agents can turn these media into robust decision-making policies.

In reinforcement learning and MARL, I study policy learning from books (PLFB), human feedback (M3HF), causal credit assignment (MACCA, GRD), and constrained decision-making (MACPO, SMALL).

In language-agent systems, I study how instructions, social interaction, and shared memory shape agent behavior, including instruction relabeling, strategic discussion, mixed-motive generalization, marketplace safety, and context management.

I am currently an Oxford IDAI Fellow working with Dr Adel Bibi and Prof. Philip Torr, and a research intern in the Future AI Group at Microsoft Research Cambridge, working with Dr Kirill P. Kalinin. I have also visited Carnegie Mellon University with Prof. Fei Fang and worked with Microsoft Research's AI Frontier Group in Redmond.

Research direction

Distilling Policy from Human Knowledge

My research goal is to distill policies from human knowledge. Knowledge can be implicit in reasoning patterns, shaped through direct instruction, preserved in books, and amplified through collective behavior; my papers study how agents can learn from these media to coordinate, adapt, and act safely.

Thinking patterns Direct instruction Books Collective behavior

Four media of human knowledge: thinking patterns, direct instruction, books, and collective behavior.

News

May 2026	Started a research internship at Microsoft Research Cambridge, focusing on multi-agent LLM communication and coordination.
Apr 2026	Memento is now online, exploring how LLMs can manage their own context.
Feb 2026	Started the Oxford IDAI Fellowship at the University of Oxford, working with Dr Adel Bibi and Prof. Philip Torr.
Nov 2025	SMALL has been accepted to AAAI AIA 2026! See you in Singapore!
Sep 2025	Starting a Research Internship at Microsoft in Redmond, focusing on LLM reasoning
Sep 2025	One Paper has been accepted to NeurIPS2025!

Experience & Visits

Research Internship, Future AI Group

Microsoft Research Cambridge, Cambridge, UK · May 2026 - present

Working with Dr Kirill P. Kalinin on multi-agent LLM communication, coordination, and collaborative agent behavior.

Oxford IDAI Fellowship

University of Oxford, Oxford, UK · Feb. 2026 - present

Working with Dr Adel Bibi and Prof. Philip Torr on real-time multi-agent LLM anomaly detection and monitoring.

Research Internship, AI Frontier Group

Microsoft Research, Redmond, US · Sep. 2025 - Dec. 2025

Worked with Vaishnavi Shrivastava and Prof. Dimitris Papailiopoulos on LLM pre-training and reasoning.

Visiting Ph.D. Student

Carnegie Mellon University, Pittsburgh, US · Feb. 2025 - Jun. 2025

Visited Prof. Fei Fang's group, working on multi-agent learning and AI for social impact.

Selected Publications

* equal contribution, ✉ corresponding author

AAAI'26

Safe Multi-agent Reinforcement Learning with Natural Language Constraints

Ziyan Wang , Meng Fang , Tristan Tomilin , Fei Fang and Yali Du

Alignment Track of the 40th Annual AAAI Conference on Artificial Intelligence (AAAI) 2026

Abs Venue arXiv PDF BibTeX

The role of natural language constraints in Safe Multi-agent Reinforcement Learning (MARL) is crucial, yet often overlooked. While Safe MARL has vast potential, especially in fields like robotics and autonomous vehicles, its full potential is limited by the need to define constraints in pre-designed mathematical terms, which requires extensive domain expertise and reinforcement learning knowledge, hindering its broader adoption. To address this limitation and make Safe MARL more accessible and adaptable, we propose a novel approach named Safe Multi-agent Reinforcement Learning with Natural Language constraints (SMALL). Our method leverages fine-tuned language models to interpret and process free-form textual constraints, converting them into semantic embeddings that capture the essence of prohibited states and behaviours. These embeddings are then integrated into the multi-agent policy learning process, enabling agents to learn policies that minimize constraint violations while optimizing rewards. To evaluate the effectiveness of SMALL, we introduce the LaMaSafe, a multi-task benchmark designed to assess the performance of multiple agents in adhering to natural language constraints. Empirical evaluations across various environments demonstrate that SMALL achieves comparable rewards and significantly fewer constraint violations, highlighting its effectiveness in understanding and enforcing natural language constraints.

@inproceedings{wang2024small, title={Safe Multi-agent Reinforcement Learning with Natural Language Constraints}, author={Wang, Ziyan and Fang, Meng and Tomilin, Tristan and Fang, Fei and Du, Yali}, booktitle={Alignment Track of the 40th Annual AAAI Conference on Artificial Intelligence}, year={2026} }
ICML'25

M3HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality

Ziyan Wang , Zhicheng Zhang , Fei Fang and Yali Du

Forty-Second International Conference on Machine Learning (ICML) 2025

Abs Venue arXiv PDF Code BibTeX

Designing effective reward functions in multi-agent reinforcement learning (MARL) is a significant challenge, often leading to suboptimal or misaligned behaviors in complex, coordinated environments. We introduce Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality (M3HF), a novel framework that integrates multi-phase human feedback of mixed quality into the MARL training process. By involving humans with diverse expertise levels to provide iterative guidance, M3HF leverages both expert and non-expert feedback to continuously refine agents' policies. During training, we strategically pause agent learning for human evaluation, parse feedback using large language models to assign it appropriately and update reward functions through predefined templates and adaptive weight by using weight decay and performance-based adjustments. Our approach enables the integration of nuanced human insights across various levels of quality, enhancing the interpretability and robustness of multi-agent cooperation. Empirical results in challenging environments demonstrate that M3HF significantly outperforms state-of-the-art methods, effectively addressing the complexities of reward design in MARL and enabling broader human participation in the training process.

@inproceedings{pmlr-v267-wang25el, title={M3HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality}, author={Wang, Ziyan and Zhang, Zhicheng and Fang, Fei and Du, Yali}, booktitle={Proceedings of the 42nd International Conference on Machine Learning}, pages={65429--65448}, year={2025}, volume={267}, series={Proceedings of Machine Learning Research}, publisher={PMLR}, url={https://proceedings.mlr.press/v267/wang25el.html} }
TMLR

MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment

Ziyan Wang , Yali Du , Yudi Zhang , Meng Fang and Biwei Huang

Transactions on Machine Learning Research (TMLR) 2025

Abs Venue arXiv PDF Code BibTeX

Offline Multi-agent Reinforcement Learning (MARL) is valuable in scenarios where online interaction is impractical or risky. While independent learning in MARL offers flexibility and scalability, accurately assigning credit to individual agents in offline settings poses challenges because interactions with an environment are prohibited. In this paper, we propose a new framework, namely Multi-Agent Causal Credit Assignment (MACCA), to address credit assignment in the offline MARL setting. Our approach, MACCA, characterizing the generative process as a Dynamic Bayesian Network, captures relationships between environmental variables, states, actions, and rewards. Estimating this model on offline data, MACCA can learn each agent's contribution by analyzing the causal relationship of their individual rewards, ensuring accurate and interpretable credit assignment. Additionally, the modularity of our approach allows it to seamlessly integrate with various offline MARL methods. Theoretically, we proved that under the setting of the offline dataset, the underlying causal structure and the function for generating the individual rewards of agents are identifiable, which laid the foundation for the correctness of our modeling. In our experiments, we demonstrate that MACCA not only outperforms state-of-the-art methods but also enhances performance when integrated with other backbones.

@inproceedings{wang2023macca, title={MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment}, author={Wang, Ziyan and Du, Yali and Zhang, Yudi and Fang, Meng and Huang, Biwei}, booktitle={Transactions on Machine Learning Research}, year={2025} }
Oral

NeurIPS'24

Policy Learning from Tutorial Books via Understanding, Rehearsing and Introspecting

Xiong-Hui Chen* , Ziyan Wang* , Yali Du , Shengyi Jiang , Meng Fang , Yang Yu and Jun Wang

The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS) 2024

Abs Venue PDF Code Website BibTeX

When humans need to learn a new skill, we can acquire knowledge through written books, including textbooks, tutorials, and comments from previous learners. However, current research for decision-making, like reinforcement learning (RL), has primarily required numerous real interactions with the target environment to learn a skill, while failing to utilize the existing knowledge already summarized in the text. The success of Large Language Models (LLMs) sheds light on utilizing such knowledge behind the books. In this paper, we discuss a new policy learning problem called Policy Learning from Books, which aims to leverage rich resources such as books and tutorials to derive a policy network. Inspired by how humans learn from books, we solve the problem via a three-stage framework: understanding, rehearsing, and introspecting (URI). In particular, it first rehearses decision-making trajectories based on the derived knowledge after understanding the books, then introspects in the imaginary dataset to distill a policy network. To validate the practicality of this methodology, we train a football-playing policy via URI and test it in the Google Football game. The agent can beat the built-in AI with a 37% winning rate without interaction with the environment during training, while using GPT as the agent can only achieve a 6% winning rate.

@inproceedings{chen2024plfb, title={Policy Learning from Tutorial Books via Understanding, Rehearsing and Introspecting}, author={Chen, Xiong-Hui and Wang, Ziyan and Du, Yali and Jiang, Shengyi and Fang, Meng and Yu, Yang and Wang, Jun}, journal={The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS)}, year={2024} }
NeurIPS'24

Learning to Discuss Strategically: A Case Study on One Night Ultimate Werewolf

Xuanfa Jin* , Ziyan Wang* , Yali Du , Meng Fang , Haifeng Zhang and Jun Wang

The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS) 2024

Abs Venue arXiv PDF Code Website BibTeX

Communication is a fundamental aspect of human society, facilitating the exchange of information and beliefs among people. Despite the advancements in large language models (LLMs), recent agents built with these often neglect the control over discussion tactics, which are essential in communication games. As a variant of the famous communication game Werewolf, One Night Ultimate Werewolf (ONUW) requires sophisticated discussion tactics due to the potential role changes that increase the uncertainty and complexity of the game. In this work, we find Perfect Bayesian Equilibria (PBEs) in the ONUW game, illustrating the significance of using discussion tactics. Furthermore, we propose a novel RL-instructed language agent framework, where a policy is employed to determine appropriate discussion tactics to adopt. Our experiment results on the ONUW game demonstrate the effectiveness and generalization ability of our proposed framework.

@inproceedings{jin2024werewolf, title={Learning to Discuss Strategically: A Case Study on One Night Ultimate Werewolf}, author={Jin, Xuanfa and Wang, Ziyan and Du, Yali and Fang, Meng and Zhang, Haifeng and Wang, Jun}, booktitle={The Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS)}, year={2024} }
NeurIPS'23

ChessGPT: Bridging Policy Learning and Language Modeling

Xidong Feng , Yicheng Luo , Ziyan Wang , Hongrui Tang , Mengyue Yang , Kun Shao , David Mguni , Yali Du and Jun Wang

The Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS) 2023

Abs Venue arXiv PDF Code Website BibTeX

When solving decision-making tasks, humans typically depend on information from two key sources:(1) Historical policy data, which provides interaction replay from the environment, and (2) Analytical insights in natural language form, exposing the invaluable thought process or strategic considerations. Despite this, the majority of preceding research focuses on only one source: they either use historical replay exclusively to directly learn policy or value functions, or engaged in language model training utilizing mere language corpus. In this paper, we argue that a powerful autonomous agent should cover both sources. Thus, we propose ChessGPT, a GPT model bridging policy learning and language modeling by integrating data from these two sources in Chess games. Specifically, we build a large-scale game and language dataset related to chess. Leveraging the dataset, we showcase two model examples ChessCLIP and ChessGPT, integrating policy learning and language modeling. Finally, we propose a full evaluation framework for evaluating language model's chess ability. Experimental results validate our model and dataset's effectiveness. We open source our code, model, and dataset at https://github. com/waterhorse1/ChessGPT.

@inproceedings{feng2023chessgpt, title={ChessGPT: Bridging Policy Learning and Language Modeling}, author={Feng, Xidong and Luo, Yicheng and Wang, Ziyan and Tang, Hongrui and Yang, Mengyue and Shao, Kun and Mguni, David and Du, Yali and Wang, Jun}, journal={The Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS)}, volume={36}, year={2023} }

View all publications

Honors & Teaching

Honors: Oxford IDAI Fellowship, NeurIPS 2024 Scholar Award, NeurIPS 2024 Oral Presentation
Teaching: Oxford Machine Learning Summer School, Oxford MLx Fundamentals Summer School, and Optimisation Methods at King’s College London

Professional Services

Conference reviewer for ICML 2023/24/25/26, NeurIPS 2023/24/25/26, ICLR 2024/25/26, AISTATS 2025/26, and AAMAS 2025/26
Journal reviewer for IEEE Robotics and Automation Letters, IEEE Transactions on Knowledge and Data Engineering, and IEEE Transactions on Artificial Intelligence

Ziyan Wang

Distilling Policy from Human Knowledge

News

Experience & Visits

Research Internship, Future AI Group

Oxford IDAI Fellowship

Research Internship, AI Frontier Group

Visiting Ph.D. Student

Selected Publications

Safe Multi-agent Reinforcement Learning with Natural Language Constraints

M3HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality

MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment

Policy Learning from Tutorial Books via Understanding, Rehearsing and Introspecting

Learning to Discuss Strategically: A Case Study on One Night Ultimate Werewolf

ChessGPT: Bridging Policy Learning and Language Modeling

Honors & Teaching

Professional Services