Ziyan Wang's Homepage

2025

M³HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality ICML'25

Ziyan Wang, Zhicheng Zhang, Fei Fang, and Yali Du

In Forty-Second International Conference on Machine Learning (ICML) , 2025

Abs arXiv Bibtex

Designing effective reward functions in multi-agent reinforcement learning (MARL) is a significant challenge, often leading to suboptimal or misaligned behaviors in complex, coordinated environments. We introduce Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality (M3HF), a novel framework that integrates multi-phase human feedback of mixed quality into the MARL training process. By involving humans with diverse expertise levels to provide iterative guidance, M3HF leverages both expert and non-expert feedback to continuously refine agents' policies. During training, we strategically pause agent learning for human evaluation, parse feedback using large language models to assign it appropriately and update reward functions through predefined templates and adaptive weight by using weight decay and performance-based adjustments. Our approach enables the integration of nuanced human insights across various levels of quality, enhancing the interpretability and robustness of multi-agent cooperation. Empirical results in challenging environments demonstrate that M3HF significantly outperforms state-of-the-art methods, effectively addressing the complexities of reward design in MARL and enabling broader human participation in the training process.

@inproceedings{wang2025m3hf, title={M³HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality}, author={Wang, Ziyan and Zhang, Zhicheng and Fang, Fei and Du, Yali}, journal={Forty-Second International Conference on Machine Learning}, year={2025} }
MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment TMLR

Ziyan Wang, Yali Du, Yudi Zhang, Meng Fang, and Biwei Huang

In Transactions on Machine Learning Research (TMLR) , 2025

Abs arXiv Bibtex

Offline Multi-agent Reinforcement Learning (MARL) is valuable in scenarios where online interaction is impractical or risky. While independent learning in MARL offers flexibility and scalability, accurately assigning credit to individual agents in offline settings poses challenges because interactions with an environment are prohibited. In this paper, we propose a new framework, namely Multi-Agent Causal Credit Assignment (MACCA), to address credit assignment in the offline MARL setting. Our approach, MACCA, characterizing the generative process as a Dynamic Bayesian Network, captures relationships between environmental variables, states, actions, and rewards. Estimating this model on offline data, MACCA can learn each agent's contribution by analyzing the causal relationship of their individual rewards, ensuring accurate and interpretable credit assignment. Additionally, the modularity of our approach allows it to seamlessly integrate with various offline MARL methods. Theoretically, we proved that under the setting of the offline dataset, the underlying causal structure and the function for generating the individual rewards of agents are identifiable, which laid the foundation for the correctness of our modeling. In our experiments, we demonstrate that MACCA not only outperforms state-of-the-art methods but also enhances performance when integrated with other backbones.

@inproceedings{wang2023macca, title={MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment}, author={Wang, Ziyan and Du, Yali and Zhang, Yudi and Fang, Meng and Huang, Biwei}, booktitle={arXiv preprint arXiv:2312.03644}, year={2024} }

2024

Policy Learning from Tutorial Books via Understanding, Rehearsing and Introspecting Oral NeurIPS'24

Xiong-Hui Chen*, Ziyan Wang*, Yali Du, Shengyi Jiang, Meng Fang, Yang Yu, and Jun Wang

In The Thirty-Eight Annual Conference on Neural Information Processing Systems (NeruIPS) , 2024

Abs Website Bibtex

When humans need to learn a new skill, we can acquire knowledge through written books, including textbooks, tutorials, and comments from previous learners. However, current research for decision-making, like reinforcement learning (RL), has primarily required numerous real interactions with the target environment to learn a skill, while failing to utilize the existing knowledge already summarized in the text. The success of Large Language Models (LLMs) sheds light on utilizing such knowledge behind the books. In this paper, we discuss a new policy learning problem called Policy Learning from Books, which aims to leverage rich resources such as books and tutorials to derive a policy network. Inspired by how humans learn from books, we solve the problem via a three-stage framework: understanding, rehearsing, and introspecting (URI). In particular, it first rehearses decision-making trajectories based on the derived knowledge after understanding the books, then introspects in the imaginary dataset to distill a policy network. To validate the practicality of this methodology, we train a football-playing policy via URI and test it in the Google Football game. The agent can beat the built-in AI with a 37% winning rate without interaction with the environment during training, while using GPT as the agent can only achieve a 6% winning rate.

@inproceedings{chen2024plfb, title={Policy Learning from Tutorial Books via Understanding, Rehearsing and Introspecting}, author={Chen, Xiong-Hui and Wang, Ziyan and Du, Yali and Jiang, Shengyi and Fang, Meng and Yu, Yang and Wang, Jun}, journal={The Thirty-Eight Annual Conference on Neural Information Processing Systems (NeruIPS)}, year={2024} }
Learning to Discuss Strategically: A Case Study on One Night Ultimate Werewolf NeurIPS'24

Xuanfa Jin*, Ziyan Wang*, Yali Du, Meng Fang, Haifeng Zhang, and Jun Wang

In The Thirty-Eight Annual Conference on Neural Information Processing Systems (NeruIPS) , 2024

Abs arXiv Bibtex

Communication is a fundamental aspect of human society, facilitating the exchange of information and beliefs among people. Despite the advancements in large language models (LLMs), recent agents built with these often neglect the control over discussion tactics, which are essential in communication games. As a variant of the famous communication game Werewolf, One Night Ultimate Werewolf (ONUW) requires sophisticated discussion tactics due to the potential role changes that increase the uncertainty and complexity of the game. In this work, we find Perfect Bayesian Equilibria (PBEs) in the ONUW game, illustrating the significance of using discussion tactics. Furthermore, we propose a novel RL-instructed language agent framework, where a policy is employed to determine appropriate discussion tactics to adopt. Our experiment results on the ONUW game demonstrate the effectiveness and generalization ability of our proposed framework.

@inproceedings{jin2024werewolf, title={Learning to Discuss Strategically: A Case Study on One Night Ultimate Werewolf}, author={Jin, Xuanfa and Wang, Ziyan and Du, Yali and Fang, Meng and Zhang, Haifeng and Wang, Jun}, booktitle={ICLR 2024 Workshop on Generative Models for Decision Making (ICLR GenAI4DM)}, year={2024} }
Safe Multi-agent Reinforcement Learning with Natural Language Constraints ICLR'24 GenAI4DM Workshop

Ziyan Wang, Meng Fang, Tristan Tomilin, Fei Fang, and Yali Du

In ICLR 2024 Workshop on Generative Models for Decision Making (ICLR GenAI4DM) , 2024

Abs arXiv Bibtex

The role of natural language constraints in Safe Multi-agent Reinforcement Learning (MARL) is crucial, yet often overlooked. While Safe MARL has vast potential, especially in fields like robotics and autonomous vehicles, its full potential is limited by the need to define constraints in pre-designed mathematical terms, which requires extensive domain expertise and reinforcement learning knowledge, hindering its broader adoption. To address this limitation and make Safe MARL more accessible and adaptable, we propose a novel approach named Safe Multi-agent Reinforcement Learning with Natural Language constraints (SMALL). Our method leverages fine-tuned language models to interpret and process free-form textual constraints, converting them into semantic embeddings that capture the essence of prohibited states and behaviours. These embeddings are then integrated into the multi-agent policy learning process, enabling agents to learn policies that minimize constraint violations while optimizing rewards. To evaluate the effectiveness of SMALL, we introduce the LaMaSafe, a multi-task benchmark designed to assess the performance of multiple agents in adhering to natural language constraints. Empirical evaluations across various environments demonstrate that SMALL achieves comparable rewards and significantly fewer constraint violations, highlighting its effectiveness in understanding and enforcing natural language constraints.

@inproceedings{wang2024small, title={Safe Multi-agent Reinforcement Learning with Natural Language Constraints}, author={Wang, Ziyan and Fang, Meng and Tomilin, Tristan and Fang, Fei and Du, Yali}, booktitle={ICLR 2024 Workshop on Generative Models for Decision Making (ICLR GenAI4DM)}, year={2024} }
Safe Reinforcement Learning with Free-form Natural Language Constraints and Pre-Trained Language Models AAMAS'24

Xingzhou Lou, Junge Zhang, Ziyan Wang, Kaiqi Huang, and Yali Du

In The 23rd International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS) , 2024

Abs arXiv Bibtex

Safe reinforcement learning (RL) agents accomplish given tasks while adhering to specific constraints. Employing constraints expressed via easily-understandable human language offers considerable potential for real-world applications due to its accessibility and non-reliance on domain expertise. Previous safe RL methods with natural language constraints typically adopt a recurrent neural network, which leads to limited capabilities when dealing with various forms of human language input. Furthermore, these methods often require a ground-truth cost function, necessitating domain expertise for the conversion of language constraints into a well-defined cost function that determines constraint violation. To address these issues, we proposes to use pre-trained language models (LM) to facilitate RL agents' comprehension of natural language constraints and allow them to infer costs for safe policy learning. Through the use of pre-trained LMs and the elimination of the need for a ground-truth cost, our method enhances safe policy learning under a diverse set of human-derived free-form natural language constraints. Experiments on grid-world navigation and robot control show that the proposed method can achieve strong performance while adhering to given constraints. The usage of pre-trained LMs allows our method to comprehend complicated constraints and learn safe policies without the need for ground-truth cost at any stage of training or evaluation. Extensive ablation studies are conducted to demonstrate the efficacy of each part of our method.

@article{Lou2024SafeRL, title={Safe Reinforcement Learning with Free-form Natural Language Constraints and Pre-Trained Language Models}, author={Lou, Xingzhou and Zhang, Junge and Wang, Ziyan and Huang, Kaiqi and Du, Yali}, journal={The 23rd International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS)}, year={2024}, volume={abs/2401.07553}, url={https://api.semanticscholar.org/CorpusID:266999399} }

2023

Chessgpt: Bridging policy learning and language modeling NeurIPS'23

Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang

In The Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeruIPS) , 2023

Abs arXiv Code Website Bibtex

When solving decision-making tasks, humans typically depend on information from two key sources:(1) Historical policy data, which provides interaction replay from the environment, and (2) Analytical insights in natural language form, exposing the invaluable thought process or strategic considerations. Despite this, the majority of preceding research focuses on only one source: they either use historical replay exclusively to directly learn policy or value functions, or engaged in language model training utilizing mere language corpus. In this paper, we argue that a powerful autonomous agent should cover both sources. Thus, we propose ChessGPT, a GPT model bridging policy learning and language modeling by integrating data from these two sources in Chess games. Specifically, we build a large-scale game and language dataset related to chess. Leveraging the dataset, we showcase two model examples ChessCLIP and ChessGPT, integrating policy learning and language modeling. Finally, we propose a full evaluation framework for evaluating language model's chess ability. Experimental results validate our model and dataset's effectiveness. We open source our code, model, and dataset at https://github. com/waterhorse1/ChessGPT.

@inproceedings{feng2024chessgpt, title={Chessgpt: Bridging policy learning and language modeling}, author={Feng, Xidong and Luo, Yicheng and Wang, Ziyan and Tang, Hongrui and Yang, Mengyue and Shao, Kun and Mguni, David and Du, Yali and Wang, Jun}, journal={The Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeruIPS)}, volume={36}, year={2024} }
Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach NeurIPS'23

Yudi Zhang, Yali Du, Biwei Huang, Ziyan Wang, Jun Wang, Meng Fang, and Mykola Pechenizkiy

In The Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeruIPS) , 2023

Abs arXiv Website Bibtex

A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed. Reward redistribution serves as a solution to re-assign credits for each time step from observed sequences. While the majority of current approaches construct the reward redistribution in an uninterpretable manner, we propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution and preserving policy invariance. In this paper, we start by studying the role of causal generative models in reward redistribution by characterizing the generation of Markovian rewards and trajectory-wise long-term return and further propose a framework, called Generative Return Decomposition (GRD), for policy optimization in delayed reward scenarios. Specifically, GRD first identifies the unobservable Markovian rewards and causal relations in the generative process. Then, GRD makes use of the identified causal generative model to form a compact representation to train policy over the most favorable subspace of the state space of the agent. Theoretically, we show that the unobservable Markovian reward function is identifiable, as well as the underlying causal structure and causal models. Experimental results show that our method outperforms state-of-the-art methods and the provided visualization further demonstrates the interpretability of our method. The project page is located at https://reedzyd. github. io/GenerativeReturnDecomposition/.

@inproceedings{NEURIPS2023_402e1210, author = {Zhang, Yudi and Du, Yali and Huang, Biwei and Wang, Ziyan and Wang, Jun and Fang, Meng and Pechenizkiy, Mykola}, booktitle = {The Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeruIPS)}, editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine}, pages = {20208--20229}, publisher = {Curran Associates, Inc.}, title = {Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach}, url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/402e12102d6ec3ea3df40ce1b23d423a-Paper-Conference.pdf}, volume = {36}, year = {2023} }

2022

Sauté rl: Almost surely safe reinforcement learning using state augmentation Spotlight ICML'22

Aivar Sootla, Alex Cowen-Rivers, er, Taher Jafferjee, Ziyan Wang, David H Mguni, Jun Wang, and Haitham Ammar

In International Conference on Machine Learning (ICML) , 2022

Abs arXiv Bibtex

Satisfying safety constraints almost surely (or with probability one) can be critical for the deployment of Reinforcement Learning (RL) in real-life applications. For example, plane landing and take-off should ideally occur with probability one. We address the problem by introducing Safety Augmented (Saute) Markov Decision Processes (MDPs), where the safety constraints are eliminated by augmenting them into the state-space and reshaping the objective. We show that Saute MDP satisfies the Bellman equation and moves us closer to solving Safe RL with constraints satisfied almost surely. We argue that Saute MDP allows viewing the Safe RL problem from a different perspective enabling new features. For instance, our approach has a plug-and-play nature, i.e., any RL algorithm can be "Sauteed". Additionally, state augmentation allows for policy generalization across safety constraints. We finally show that Saute RL algorithms can outperform their state-of-the-art counterparts when constraint satisfaction is of high importance.

@inproceedings{sootla2022saute, title={Sauté rl: Almost surely safe reinforcement learning using state augmentation}, author={Sootla, Aivar and Cowen-Rivers, Alexander and Jafferjee, Taher and Wang, Ziyan and Mguni, David H and Wang, Jun and Ammar, Haitham}, booktitle={International Conference on Machine Learning}, pages={20423--20443}, year={2022}, organization={PMLR} }

2021

Multi-Agent Constrained Policy Optimisation Preprint

Shangding Gu, Jakub Kuba, Muning Wen, Ruiqing Chen, Ziyan Wang, Zheng Tian, Jun Wang, Alois Knoll, and Yaodong Yang

In arXiv preprint , 2021

Abs arXiv Code Bibtex

Developing reinforcement learning algorithms that satisfy safety constraints is becoming increasingly important in real-world applications. In multi-agent reinforcement learning (MARL) settings, policy optimisation with safety awareness is particularly challenging because each individual agent has to not only meet its own safety constraints, but also consider those of others so that their joint behaviour can be guaranteed safe. Despite its importance, the problem of safe multi-agent learning has not been rigorously studied; very few solutions have been proposed, nor a sharable testing environment or benchmarks. To fill these gaps, in this work, we formulate the safe MARL problem as a constrained Markov game and solve it with policy optimisation methods. Our solutions -- Multi-Agent Constrained Policy Optimisation (MACPO) and MAPPO-Lagrangian -- leverage the theories from both constrained policy optimisation and multi-agent trust region learning. Crucially, our methods enjoy theoretical guarantees of both monotonic improvement in reward and satisfaction of safety constraints at every iteration. To examine the effectiveness of our methods, we develop the benchmark suite of Safe Multi-Agent MuJoCo that involves a variety of MARL baselines. Experimental results justify that MACPO/MAPPO-Lagrangian can consistently satisfy safety constraints, meanwhile achieving comparable performance to strong baselines.

@inproceedings{Gu2021MultiAgentCP, title={Multi-Agent Constrained Policy Optimisation}, author={Gu, Shangding and Kuba, Jakub and Wen, Muning and Chen, Ruiqing and Wang, Ziyan and Tian, Zheng and Wang, Jun and Knoll, Alois and Yang, Yaodong}, booktitle={arXiv preprint}, year={2021}, url={https://api.semanticscholar.org/CorpusID:238407788} }

Publications

2025

2024

2023

2022

2021