Plan Better Amid Conservatism: Offline MultiAgent Reinforcement Learning with Actor Rectification
Abstract
The idea of conservatism has led to significant progress in offline reinforcement learning (RL) where an agent learns from precollected datasets. However, it is still an open question to resolve offline RL in the more practical multiagent setting as many realworld scenarios involve interaction among multiple agents. Given the recent success of transferring online RL algorithms to the multiagent setting, one may expect that offline RL algorithms will also transfer to multiagent settings directly. Surprisingly, when conservatismbased algorithms are applied to the multiagent setting, the performance degrades significantly with an increasing number of agents. Towards mitigating the degradation, we identify that a key issue that the landscape of the value function can be nonconcave and policy gradient improvements are prone to local optima. Multiple agents exacerbate the problem since the suboptimal policy by any agent could lead to uncoordinated global failure. Following this intuition, we propose a simple yet effective method, Offline MultiAgent RL with Actor Rectification (OMAR), to tackle this critical challenge via an effective combination of firstorder policy gradient and zerothorder optimization methods for the actor to better optimize the conservative value function. Despite the simplicity, OMAR significantly outperforms strong baselines with stateoftheart performance in multiagent continuous control benchmarks.
1 Introduction
Offline reinforcement learning (RL) has shown great potential in advancing the deployment of RL in realworld tasks where interaction with the environment is prohibitive, costly, or risky [53]. Since an agent has to learn from a given precollected dataset in offline RL, it becomes challenging for regular online RL algorithms such as DDPG [29] and TD3 [12] due to extrapolation error [28].
There has been recent progress in tackling the problem based on conservatism. Behavior regularization [56, 26], e.g., TD3 with Behavior Cloning (TD3+BC) [11], compels the learning policy to stay close to the manifold of the datasets. Yet, its performance highly depends on the quality of the dataset. Another line of research investigates incorporating conservatism into the value function by critic regularization [37, 25], e.g., Conservative QLearning [27], which usually learns a conservative estimate of the value function to directly address the extrapolation error.
However, many practical scenarios involve multiple agents, e.g., multirobot control [4], autonomous driving [41, 45]. Therefore, offline multiagent reinforcement learning (MARL) [57, 19] is crucial for solving realworld tasks. Observing recent success of Independent PPO [8] and MultiAgent PPO [58], both of which are based on the PPO [49] algorithm, we find that online RL algorithms can be transferred to multiagent scenarios through either decentralized training or a centralized value function without bells and whistles. Hence, we naturally expect that offline RL algorithms would also transfer easily when applied to multiagent tasks.
Surprisingly, we observe that the performance of the stateoftheart conservatismbased CQL [27] algorithm in offline RL degrades dramatically with an increasing number of agents as shown in Figure 1(c) in our experiments. Towards mitigating the degradation, we identify a critical issue in CQL: solely regularizing the critic is insufficient for multiple agents to learn good policies for coordination in the offline setting. The primary cause is that firstorder policy gradient methods are prone to local optima [36, 14, 46], saddle points [54, 52], or noisy gradient estimates [51]. As a result, this can lead to uncoordinated suboptimal learning behavior because the actor cannot leverage the global information in the critic well. The issue is exacerbated more in the multiagent settings due to the exponentiallysized joint action space [57] as well as the nature of the setting that requires each of the agent to learn a good policy for a successful joint policy. For example, in a basketball game, where there are two competing teams each consisting of five players. When one of the players passes the ball among them, it is important for all teammates to perform their duties well in their roles to win the game. As a result, if one of the agents in the team fails to learn a good policy, it can fail to cooperate with other agents for coordinated behaviors and lose the ball.
In this paper, we propose a surprisingly simple yet effective method for offline multiagent continuous control, Offline MARL with Actor Rectification (OMAR), to better leverage the conservative value function via an effective combination of firstorder policy gradient and zerothorder optimization methods. Towards this goal, we add a regularizer to the actor loss, which encourages the actor to mimic actions from the zerothorder optimizer that maximizes Qvalues so that we can combine the best of both firstorder policy gradient and zerothorder optimization. The sampling mechanism is motivated by evolution strategies [51, 5, 35], which recently emerged as another paradigm for solving sequential decision making tasks [47]. Specifically, the zerothorder optimization part maintains an iteratively updated and refined Gaussian distribution to find better actions based on Qvalues. Then, we rectify the policy towards this action to better leverage the conservative value function. We conduct extensive experiments in standard continuous control multiagent particle environments and the complex multiagent locomotion task to demonstrate its effectiveness. On all the benchmark tasks, OMAR outperforms the multiagent version of offline RL algorithms including CQL [27] and TD3+BC [11], as well as a recent offline MARL algorithm MAICQ [57], and achieves the stateoftheart performance.
The main contribution of this work can be summarized as follows. We propose the OMAR algorithm that effectively leverages both firstorder and zeroorder optimization for solving offline MARL tasks. In addition, we theoretically prove that OMAR leads to safe policy improvement. Finally, extensive experimental results demonstrate the effectiveness of OMAR, which significantly outperforms strong baseline methods and achieves stateoftheart performance in datasets with different qualities in both decentralized and centralized learning paradigms.
2 Background
We consider the framework of partially observable Markov games (POMG) [31, 16], which extends Markov decision processes to the multiagent setting. A POMG with agents is defined by a set of global states , a set of actions , and a set of observations for each agent. At each timestep, each agent receive an observation and chooses an action based on its policy . The environment transits to the next state according to the state transition function . Each agent receives a reward based on the reward function and a private observation . The initial state distribution is defined by . The goal is to find a set of optimal policies , where each agent aims to maximize its own discounted return with denoting the discount factor. In the offline setting, agents learn from a fixed dataset generated from the behavior policy without interaction with the environments.
2.1 MultiAgent Actor Critic
Centralized critic.
Lowe et al. [32] propose MultiAgent Deep Deterministic Policy Gradients (MADDPG) under the centralized training with decentralized execution (CTDE) paradigm by extending the DDPG algorithm [29] to the multiagent setting. In CTDE, agents are trained in a centralized way where they can access to extra global information during training while they need to learn decentralized policies in order to act based only on local observations during execution. In MADDPG, for an agent , the centralized critic is parameterized by . It takes the global state action joint action as inputs, and aims to minimize the temporal difference error defined by , where and and denote target networks. To reduce the overestimation problem in MADDPG, MATD3 [1] estimates the target value using double estimators based on TD3 [12], where . Agents learn decentralized policies parameterized by , which take only local observations as inputs, and are trained by multiagent policy gradients according to , where is predicted from its policy while are sampled from the replay buffer
Decentralized critic.
Although using centralized critics is widelyadopted in multiagent actorcritic methods, it introduces scalability issues due to the exponentially sized joint action space w.r.t. the number of agents [17]. On the other hand, independent learning approaches train decentralized critics that take only the local observation and action as inputs. It is shown in de Witt et al. [8], Lyu et al. [34] that decentralized value functions can result in more robust performance and be beneficial in practice compared with centralized critic approaches. de Witt et al. [8] propose Independent Proximal Policy Optimization (IPPO) based on PPO [49], and show that it can match or even outperform CTDE approaches in the challenging discrete control benchmark tasks [48]. We can also obtain the Independent TD3 (ITD3) algorithm based on decentralized critics, which is trained to minimize the temporal difference error defined by , where .
2.2 Conservative QLearning
Conservative QLearning (CQL) [27] adds a regularizer to the critic loss to address the extrapolation error and learns lowerbounded Qvalues. It penalizes Qvalues of stateaction pairs sampled from a uniform distribution or a policy while encouraging Qvalues for stateaction pairs in the dataset to be large. Specifically, when built upon decentralized critic methods in MARL, the critic loss is defined as in Eq. (1), where denotes the regularization coefficient and is the empirical behavior policy of agent .
(1) 
3 Proposed Method
In this section, we first provide a motivating example where previous methods, such as CQL [27] and TD3+BC [11] can be inefficient in the face of the multiagent setting. Then, we propose a method called Offline MultiAgent Reinforcement Learning with Actor Rectification (OMAR), where we effectively combine firstorder policy gradients and zerothorder optimization methods for the actor to better optimize the conservative value function.
3.1 The Motivating Example
We design a Spread environment as shown in Figure 1(a) which involves agents and landmarks () with 1dimensional action space to demonstrate the problem and reveal interesting findings. For the multiagent setting in the Spread task, agents need to learn how to cooperate to cover all landmarks and avoid colliding with each other or arriving at the same landmark by coordinating their actions. The experimental setup is the same as in Section 4.1.1.
Figure 1(b) demonstrates the performance of the multiagent version of TD3+BC [11], CQL [27], and OMAR based on ITD3 in the mediumreplay dataset from the twoagent Spread environment. As MATD3+BC is based on behavior regularization that compels the learned policy to stay close to the behavior policy, its performance largely depends on the quality of the dataset. Moreover, it can be detrimental to regularize policies to be close to the dataset in multiagent settings due to decentralized training and the resulting partial observations. MACQL outperforms MATD3+BC, which pushes down Qvalues of stateaction pairs that are sampled from a random or the current policy while pushing up Qvalues for stateaction pairs in the dataset.
Figure 1(c) demonstrates the performance improvement percentage of MACQL over the behavior policy with an increasing number of agents ranging from one to five. From Figure 1(c), we observe that its performance degrades dramatically as there are more agents.^{1}^{1}1We also investigate the performance of MACQL in a noncooperative version of the Spread task in Appendix B.3, whose performance does not degrade with an increasing number of agents which does not require coordination.
Towards mitigating the performance degradation, we identify a key issue in MACQL that solely regularizing the critic is insufficient for multiple agents to learn good policies for coordination. In Figure 1(d), we visualize the Qfunction landscape of MACQL during training for an agent in a timestep, with the red circle corresponding to the predicted action from the actor. The green triangle represents the action predicted from the actor after the training step, where the policy gets stuck in a bad local optimum. The firstorder policy gradient method is prone to local optima [6, 3], where the agent can fail to globally leverage the conservative value function well and thus leading to suboptimal, uncoordinated learning behavior. Note that the problem is exacerbated more in the offline multiagent setting due to the exponentially sized joint action space w.r.t. the number of agents [57]. In addition, it usually requires each of the agent to learn a good policy for coordination to solve the task, and the suboptimal policy by any agent could result in uncoordinated global failure.
Tables 2 and 2 show the performance of MACQL by increasing the learning rate or the number of updates for the actor. The results illustrate that, to solve this challenging problem, we need a better solution than blindly tuning hyperparameters. In the next section, we introduce how we tackle this problem by combining zerothorder optimization with current RL algorithms.
Learning rate  

Performance  
# Updates  

Performance  
3.2 Offline MultiAgent Reinforcement Learning with Actor Rectification
Our key identification as above is that policy gradient improvements are prone to local optima given a bad value function landscape. It is important to note that this presents a particularly critical challenge in the multiagent setting since it is sensitive to suboptimal actions. Zerothorder optimization methods, e.g., evolution strategies [44, 51, 5, 47, 35], offer an alternative for policy optimization and are also robust to local optima [44].
We propose Offline MultiAgent Reinforcement Learning with Actor Rectification (OMAR) which incorporates sampled actions based on Qvalues to rectify the actor so that it can escape from bad local optima. For simplicity of presentation, we demonstrate our method based on the decentralized training paradigm introduced in Section 2.1. Note that it can also be applied to centralized critics, as shown in Section 4.1.4. Specifically, we add a regularizer to the policy objective:
(2) 
where is the action provided by the zerothorder optimizer and denotes the regularization coefficient. Note that TD3+BC [11] uses the seen action in the dataset for . The distinction between optimized and seen actions enables OMAR to perform well even if the dataset quality is from mediocre to low.
We borrow intuition for sampling actions from recent evolution strategy (ES) algorithms, which show a welcoming avenue towards using zerothorder method for policy optimization. For example, the crossentropy method (CEM) [44], a popular ES algorithm, has shown great potential in RL [30], especially by sampling in the parameter space of the actor [42]. However, CEM does not scale to tasks with highdimensional space well [38]. We therefore propose to sample actions in a softer way motivated by Williams et al. [55], Lowrey et al. [33]. Specifically, we sample actions according to an iteratively refined Gaussian distribution . At each iteration , we draw candidate actions by and evaluate their Qvalues. The mean and standard deviation of the sampling distribution is updated and refined by Eq. (3), which produces a softer update and leverages more samples in the update [38]. The OMAR algorithm is shown in Algorithm 1.
(3) 
Besides the algorithmic design, we also prove that OMAR gives a safe policy improvement guarantee. Let denote the discounted return of a policy in the empirical MDP which is induced by transitions in the dataset , i.e., . In Theorem 1, we give a lower bound on the difference between the policy performance of OMAR over the empirical behavior policy in the empirical MDP . The proof can be found in Appendix A.
Theorem 1.
Let be the policy obtained by optimizing Eq. (2). Then, we have that
As shown in Theorem 1, the difference between the second and third terms on the righthand side is the difference between two expected distances. The former corresponds to the gap between the optimal action and the action from our zerothorder optimizer, while the latter corresponds to the gap between the action from the behavior policy and the optimized action. Since both terms can be bounded, we find that OMAR gives a safe policy improvement guarantee over .
3.2.1 Discussion of the Effect of OMAR in the Spread Environment
Can OMAR Address the Identified Problem?
We investigate whether OMAR can address the identified problem and analyze its effect in the Spread environment introduced in Section 3.1. In Figure 1(d), the blue square corresponds to the action from the updated actor using OMAR according to Eq. (2). In contrast to the policy update in MACQL, OMAR can better leverage the global information in the critic and help the actor to escape from the bad local optima. Figure 1(b) further validates that OMAR significantly improves MACQL in terms of both performance and efficiency. Figure 2 shows the performance improvement percentage of OMAR over MACQL with varying number of agents, where OMAR always outperforms MACQL. We also notice that the performance improvement of OMAR over MACQL is much more significant in the multiagent setting in the Spread task than the singleagent setting, which echoes with what is discussed above that the problem becomes more critical in scenarios with more agents that requires each of the agents to learn a good policy to cooperate for solving the task.
Is OMAR Effective in Online/Offline, MultiAgent/SingleAgent Settings?
We investigate the effectiveness of OMAR in the four following settings: i) online multiagent setting, ii) online singleagent setting, iii) offline singleagent setting, iv) offline multiagent setting in the Spread environment shown in Figure 1(a).
For the online setting, we build our method upon the MATD3 algorithm with our proposed policy objective in Eq. (2), and evaluate the performance improvement percentage of our method over MATD3. The results for the online setting are shown in the right part in Figure 3, where the xaxis corresponds to the performance improvement percentage and the yaxis corresponds to the number of agents indicating whether its singleagent or multiagent setting. We combine the results in Figure 2 for the offline setting which shows the performance improvement percentage of OMAR over MACQL in the left part in Figure 3 for a better understanding of the effectiveness of our method in different settings.
As shown in Figure 3, our method is generally applicable in all the settings. However, the performance improvement is much more significant in the offline setting (left part) than the online case (right part), because the agents cannot explore and interact with the environment. Intuitively, in the online setting, if the actor has not well exploited the global information in the value function, it can still interact with the environment to collect better experiences for improving the estimation of the value function and provides a better guidance for the policy. However, no exploration and interaction with the environment for new data collection are allowed in the offline setting. Thus, it is much harder for an agent to escape from a bad local optimum and difficult for the actor to best leverage the global information in the critic. This presents an even more challenging problem in MARL because multiple agents result in an exponentiallysized joint action space as well as the nature of the setting that requires a coordinated joint policy. As expected, we also find that the performance gain is more significant in the offline multiagent domain, which requires each of the agents to learn a good policy for a successful joint policy for coordination. Otherwise, it can lead to an uncoordinated global failure.
4 Experiments
In this section, we conduct a series of experiments to study the following key questions: i) How does OMAR compare against stateoftheart offline RL and offline MARL methods? ii) What is the effect of critical hyperparameters and the sampling scheme? iii) Does the method help in both centralized training and decentralized training paradigms? iv) Can OMAR scale to the more complex continuous multiagent locomotion tasks and discrete control StarCraft II micromanagement benchmarks?
4.1 MultiAgent Particle Environments
4.1.1 Experimental Setup
We first conduct a series of experiments in the widelyadopted multiagent particle tasks [32] as shown in Figure 7 in Appendix B.1. The cooperative navigation task includes agents and landmarks, where agents are rewarded based on the distance to the landmarks and penalized for colliding with each other. Thus, it is important for agents to cooperate to cover all landmarks without collision. In predatorprey, predators aim to catch the prey. The predators need to cooperate to surround and catch the prey as the predators are slower than the prey. The world task involves slower cooperating agents that aim to catch faster adversaries, where adversaries desire to eat foods while avoiding being captured.
We construct a variety of datasets according to behavior policies with different qualities based on adding noises to the MATD3 algorithm to increase diversity following previous work [10]. The random dataset is generated by rolling out a randomly initialized policy for 1 million (M) steps. We obtain the mediumreplay dataset by recording all samples in the experience replay buffer during the training process until the policy reached the medium level of performance. The medium dataset consists of 1M samples by unrolling a partiallypretrained policy in the online setting whose performance reaches a medium level of the performance. The expert dataset is constructed by 1M expert demonstrations from an online fullytrained policy.
We compare OMAR against stateoftheart offline RL algorithms including CQL [27] and TD3+BC [11]. We also compare with a recent offline MARL algorithm MAICQ [57]. We build all methods on independent TD3 based on decentralized critics following de Witt et al. [8], while we also consider centralized critics based on MATD3 following Yu et al. [58] in Section 4.1.4. All baselines are implemented based on the opensource code.^{2}^{2}2https://github.com/shariqiqbal2810/maddpgpytorch Each algorithm is run for five random seeds, and we report the mean performance with standard deviation. A detailed description of the construction of the datasets and hyperparameters can be found in Appendix B.1.
4.1.2 Performance Comparison
Table 3 summarizes the average normalized scores in different datasets in multiagent particle environments, where the learning curves are shown in Appendix B.2. The normalized score is computed as following Fu et al. [10] As shown, the performance of MATD3+BC highly depends on the quality of the dataset. As the MAICQ method is based on only trusting seen stateaction pairs in the dataset, it does not perform well in datasets with more diverse data distribution including random and mediumreplay datasets, while generally matches the performance of MATD3+BC in datasets with more narrow distribution including medium and expert. MACQL matches or outperforms MATD3+BC in datasets with lower quality except for the expert dataset, as it does not rely on constraining the learning policy to stay close to the behavior policy. Our OMAR method significantly outperforms all baseline methods and achieves stateoftheart performance. We attribute the performance gain to the actor rectification scheme that is independent of data quality and improves global optimization. In addition, OMAR does not incur much computation cost and only takes more runtime on average compared with that of MACQL.
MAICQ  MATD3+BC  MACQL  OMAR  

Random 
Cooperative navigation  
Predatorprey  
World  
Medium
replay

Cooperative navigation  
Predatorprey  
World  
Medium 
Cooperative navigation  
Predatorprey  
World  
Expert 
Cooperative navigation  
Predatorprey  
World 
4.1.3 Ablation Study
The effect of the regularization coefficient.
We first investigate the effect of the regularization coefficient in the actor loss in Eq. (2). Figure 4 shows the averaged normalized score of OMAR over different tasks with different values of in each kind of dataset. As shown, the performance of OMAR is sensitive to this hyperparameter, which controls the exploitation level of the critic. We find the best value of is neither close to nor , showing that it is the combination of both policy gradients and the actor rectification that performs well. We also notice that the optimal value of is smaller for datasets with lower quality and more diverse data distribution including random and mediumreplay, but larger for medium and expert datasets. In addition, the performance of OMAR with all values of matches or outperforms that of MACQL. This is the only hyperparameter that needs to be tuned in OMAR beyond MACQL.
The effect of key hyperparameters in the sampling scheme.
Core hyperparameters for our sampling mechanism involves the number of iterations, the number of sampled actions, and the initial mean and standard deviation of the Gaussian distribution. Figures 5(a)(d) show the performance comparison of OMAR with different values of these hyperparameters in the cooperative navigation task, where the grey dotted line corresponds to the normalized score of MACQL. As shown, our sampling mechanism is not sensitive to these hyperparameters, and we fix them to be the set with the best performance.
The effect of the sampling mechanism.
We now analyze the effect of the zerothorder optimization methods in OMAR, and compare it against random shooting and the crossentropy method (CEM) [7] in the cooperative navigation task. As shown in Table 4, our sampling mechanism significantly outperforms the random sampling scheme and CEM, with a larger margin in datasets with lower quality including random and mediumreplay. The proposed sampling technique incorporates more samples into the distribution updates more effectively.
Random  Mediumreplay  Medium  Expert  

OMAR (random)  
OMAR (CEM)  
OMAR 
We also investigate the effect of the size of the dataset for the performance of OMAR in Appendix B.5.
4.1.4 Applicability on Centralized Training with Decentralized Execution
In this section, we demonstrate the versatility of the method and show that it can also be applied and beneficial to methods based on centralized critics under the CTDE paradigm. Specifically, all baseline methods are built upon the MATD3 algorithm [1] using centralized critics as detailed in Section 2.1.^{3}^{3}3Performance comparison of a centralized value function and a decentralized one can be found in Appendix B.6. Table 5 summarizes the averaged normalized score of different algorithms in each kind of dataset. As shown, OMAR (centralized) also significantly outperforms MAICQ (centralized) and MACQL (centralized), and matches the performance of MATD3+BC (centralized) in the expert dataset while outperforming it in other datasets.
Random  Mediumreply  Medium  Expert  

MAICQ  
MATD3+BC  
MACQL  
OMAR 
4.2 MultiAgent MuJoCo
In this section, we investigate whether OMAR can scale to more complex continuous control multiagent tasks. Peng et al. [40] introduce multiagent locomotion tasks which extends the highdimensional MuJoCo locomotion tasks in the singleagent setting to the multiagent case. We consider the twoagent HalfCheetah task [23] as shown in Appendix B.1, where the first and second agents control different parts of joints of the robot. Agents need to cooperate to make the robot run forward by coordinating the actions. We also construct different types of datasets following Fu et al. [10] the same as in Section 4.1.1. Table 6 summarizes the average normalized scores in each kind of dataset in multiagent HalfCheetah. As shown, OMAR significantly outperforms baseline methods in random, mediumreplay, and medium datasets, and matches the performance of MATD3+BC in expert, demonstrating its effectiveness to scale to more complex control tasks.
Random  Mediumreply  Medium  Expert  

MAICQ  
MATD3+BC  
MACQL  
OMAR 
4.3 StarCraft II Micromanagement Benchmark
In this section, we investigate the effectiveness of OMAR in largerscale tasks based on the challenging StarCraft II micromanagement benchmark [48] on maps with an increasing number of agents and difficulties including 2s3z, 3s5z, 1c3s5z, and 2c_vs_64zg. Details for the tasks are shown in Table 7 in Appendix B.1. We compare OMAR and MACQL based on the evaluation protocol in Kumar et al. [27], Agarwal et al. [2], Gulcehre et al. [15], where datasets are constructed following Agarwal et al. [2], Gulcehre et al. [15] by recording samples observed during training. Each dataset consists of 1 million samples. We use the GumbelSoftmax reparameterization trick [18] to generate discrete actions for MATD3 since it requires differentiable policies [32, 17, 40]. All implementations are based on opensourced code.^{4}^{4}4https://github.com/oxwhirl/comix and the same experimental setup as in Appendix B.1
Figure 6 demonstrates the comparison result in test win rates. As shown, OMAR significantly outperforms MACQL in performance and learning efficiency, and the average performance gain of OMAR compared to MACQL is in all tested maps.
5 Related Work
Offline reinforcement learning.
Many recent papers achieve improvements in offline RL [56, 26, 59, 22, 25] that address the extrapolation error. Behavior regularization typically compels the learning policy to stay close to the behavior policy. Yet, its performance relies heavily on the quality of the dataset. Critic regularization approaches typically add a regularizer to the critic loss that pushes down Qvalues for actions sampled from a given policy [27]. As discussed above, it can be difficult for the actor to best leverage the global information in the critic as policy gradient methods are prone to local optima, which is particularly important in the offline multiagent setting.
Multiagent reinforcement learning.
A number of multiagent policy gradient algorithms train agents based on centralized value functions [32, 9, 43, 58, 39] while another line of research focuses on decentralized training [8]. Yang et al. [57] show that the extrapolation error in offline RL can be more severe in the multiagent setting than the singleagent case due to the exponentially sized joint action space w.r.t. the number of agents. In addition, it presents a critical challenge in the decentralized setting when the datasets for each agent only consist of its own action instead of the joint action [19]. Jiang and Lu [19] address the challenges based on the behavior regularization BCQ [13] algorithm while Yang et al. [57] propose to estimate the target value based on the next action from the dataset. As a result, both methods largely depend on the quality of the dataset.
Zerothorder optimization method.
It has been recently shown in [51, 5, 35] that evolutionary strategies (ES) emerge as another paradigm for continuous control. Recent research shows that it is potential to combine RL with ES to reap the best of both worlds [21, 42] in the highdimensional parameter space for the actor. Sun et al. [52] replace the policy gradient update via supervised learning based on sampled noises from random shooting. Kalashnikov et al. [20], Lim et al. [30], SimmonsEdler et al. [50], Peng et al. [40] extend Qlearning based approaches to handle continuous action space based on the popular crossentropy method (CEM) in ES.
6 Conclusion
In this paper, we identify the problem that when extending conservatismbased RL algorithms to offline multiagent scenarios, the performance degrades along increasing number of agents. To tackle this problem, propose Offline MultiAgent RL with Actor Rectification (OMAR) that combines firstorder policy gradient with zerothorder optimization. We find that OMAR can successfully help the actor escape from bad local optima and consequently find better actions. OMAR achieves stateoftheart performance on multiagent continuous control tasks empirically.
Acknowledgements
We thank Bei Peng for help with results of MATD3 in the StarCraft II micromanagement benchmark. We also thank Qingpeng Cai, Kefan Dong, Colin Wei, Yuping Luo, and Jeff Z. Haochen for insightful discussion. The work of Ling Pan and Longbo Huang is supported in part by the Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grant 2020AAA0108400 and 2020AAA0108403. Ling Pan is supported by Microsoft Research Asia Fellowship. TM acknowledges the support of Google Faculty Award, NSF IIS 2045685, the Sloan Fellowship, and JD.com
References
 Ackermann et al. [2019] Johannes Ackermann, Volker Gabler, Takayuki Osa, and Masashi Sugiyama. Reducing overestimation bias in multiagent domains using double centralized critics. arXiv preprint arXiv:1910.01465, 2019.
 Agarwal et al. [2020] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pages 104–114. PMLR, 2020.
 Ahmed et al. [2019] Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning, pages 151–160. PMLR, 2019.
 Amato [2018] Christopher Amato. Decisionmaking under uncertainty in multiagent and multirobot systems: Planning and learning. In IJCAI, pages 5662–5666, 2018.
 Conti et al. [2017] Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Improving exploration in evolution strategies for deep reinforcement learning via a population of noveltyseeking agents. arXiv preprint arXiv:1712.06560, 2017.
 Dauphin et al. [2014] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2933–2941, 2014.
 De Boer et al. [2005] PieterTjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. A tutorial on the crossentropy method. Annals of operations research, 134(1):19–67, 2005.
 de Witt et al. [2020] Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the starcraft multiagent challenge? arXiv preprint arXiv:2011.09533, 2020.
 Foerster et al. [2018] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multiagent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
 Fu et al. [2020] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep datadriven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
 Fujimoto and Gu [2021] Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. arXiv preprint arXiv:2106.06860, 2021.
 Fujimoto et al. [2018] Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actorcritic methods. In International Conference on Machine Learning, pages 1587–1596. PMLR, 2018.
 Fujimoto et al. [2019] Scott Fujimoto, David Meger, and Doina Precup. Offpolicy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062. PMLR, 2019.
 Ge et al. [2017] Rong Ge, Jason D Lee, and Tengyu Ma. Learning onehiddenlayer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017.
 Gulcehre et al. [2020] Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio Gomez Colmenarejo, Konrad Zolna, Rishabh Agarwal, Josh Merel, Daniel Mankowitz, Cosmin Paduraru, et al. Rl unplugged: Benchmarks for offline reinforcement learning. arXiv eprints, pages arXiv–2006, 2020.
 Hu et al. [1998] Junling Hu, Michael P Wellman, et al. Multiagent reinforcement learning: theoretical framework and an algorithm. In ICML, volume 98, pages 242–250. Citeseer, 1998.
 Iqbal and Sha [2019] Shariq Iqbal and Fei Sha. Actorattentioncritic for multiagent reinforcement learning. In International Conference on Machine Learning, pages 2961–2970. PMLR, 2019.
 Jang et al. [2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 Jiang and Lu [2021] Jiechuan Jiang and Zongqing Lu. Offline decentralized multiagent reinforcement learning. arXiv preprint arXiv:2108.01832, 2021.
 Kalashnikov et al. [2018] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qtopt: Scalable deep reinforcement learning for visionbased robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
 Khadka and Tumer [2018] Shauharda Khadka and Kagan Tumer. Evolutionguided policy gradient in reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 1196–1208, 2018.
 Kidambi et al. [2020] Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Modelbased offline reinforcement learning. arXiv preprint arXiv:2005.05951, 2020.
 Kim et al. [2021] Dong Ki Kim, Miao Liu, Matthew D Riemer, Chuangchuang Sun, Marwa Abdulhai, Golnaz Habibi, Sebastian LopezCot, Gerald Tesauro, and Jonathan How. A policy gradient algorithm for learning to learn in multiagent reinforcement learning. In International Conference on Machine Learning, pages 5541–5550. PMLR, 2021.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kostrikov et al. [2021] Ilya Kostrikov, Rob Fergus, Jonathan Tompson, and Ofir Nachum. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pages 5774–5783. PMLR, 2021.
 Kumar et al. [2019] Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing offpolicy qlearning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949, 2019.
 Kumar et al. [2020] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative qlearning for offline reinforcement learning. In Advances in Neural Information Processing Systems, volume 33, pages 1179–1191, 2020.
 Lee et al. [2021] Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offlinetoonline reinforcement learning via balanced replay and pessimistic qensemble. arXiv preprint arXiv:2107.00591, 2021.
 Lillicrap et al. [2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR (Poster), 2016.
 Lim et al. [2018] Sungsu Lim, Ajin Joseph, Lei Le, Yangchen Pan, and Martha White. Actorexpert: A framework for using qlearning in continuous action spaces. arXiv preprint arXiv:1810.09103, 2018.
 Littman [1994] Michael L Littman. Markov games as a framework for multiagent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. Elsevier, 1994.
 Lowe et al. [2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. Advances in Neural Information Processing Systems, 30:6379–6390, 2017.
 Lowrey et al. [2018] Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan online, learn offline: Efficient learning and exploration via modelbased control. arXiv preprint arXiv:1811.01848, 2018.
 Lyu et al. [2021] Xueguang Lyu, Yuchen Xiao, Brett Daley, and Christopher Amato. Contrasting centralized and decentralized critics in multiagent reinforcement learning. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pages 844–852, 2021.
 Mania et al. [2018] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.
 Nachum et al. [2016] Ofir Nachum, Mohammad Norouzi, and Dale Schuurmans. Improving policy gradient by exploring underappreciated rewards. arXiv preprint arXiv:1611.09321, 2016.
 Nachum et al. [2019] Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019.
 Nagabandi et al. [2020] Anusha Nagabandi, Kurt Konolige, Sergey Levine, and Vikash Kumar. Deep dynamics models for learning dexterous manipulation. In Conference on Robot Learning, pages 1101–1112. PMLR, 2020.
 Pan et al. [2021] Ling Pan, Tabish Rashid, Bei Peng, Longbo Huang, and Shimon Whiteson. Regularized softmax deep multiagent qlearning. In ThirtyFifth Conference on Neural Information Processing Systems, 2021.
 Peng et al. [2020] Bei Peng, Tabish Rashid, Christian A Schroeder de Witt, PierreAlexandre Kamienny, Philip HS Torr, Wendelin Böhmer, and Shimon Whiteson. Facmac: Factored multiagent centralised policy gradients. arXiv preprint arXiv:2003.06709, 2020.
 Pomerleau [1989] Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Technical report, CARNEGIEMELLON UNIV PITTSBURGH PA ARTIFICIAL INTELLIGENCE AND PSYCHOLOGY …, 1989.
 Pourchot and Sigaud [2019] Pourchot and Sigaud. CEMRL: Combining evolutionary and gradientbased methods for policy search. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=BkeU5j0ctQ.
 Rashid et al. [2018] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multiagent reinforcement learning. In International Conference on Machine Learning, pages 4295–4304. PMLR, 2018.
 Rubinstein and Kroese [2013] Reuven Y Rubinstein and Dirk P Kroese. The crossentropy method: a unified approach to combinatorial optimization, MonteCarlo simulation and machine learning. Springer Science & Business Media, 2013.
 Sadigh et al. [2016] Dorsa Sadigh, Shankar Sastry, Sanjit A Seshia, and Anca D Dragan. Planning for autonomous cars that leverage effects on human actions. In Robotics: Science and Systems, volume 2, pages 1–9. Ann Arbor, MI, USA, 2016.
 Safran and Shamir [2017] Itay Safran and Ohad Shamir. Spurious local minima are common in twolayer relu neural networks. arXiv preprint arXiv:1712.08968, 2017.
 Salimans et al. [2017] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
 Samvelyan et al. [2019] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim GJ Rudner, ChiaMan Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multiagent challenge. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pages 2186–2188, 2019.
 Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 SimmonsEdler et al. [2019] Riley SimmonsEdler, Ben Eisner, Eric Mitchell, Sebastian Seung, and Daniel Lee. Qlearning for continuous actions with crossentropy guided policies. arXiv preprint arXiv:1903.10605, 2019.
 Such et al. [2017] Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567, 2017.
 Sun et al. [2020] Hao Sun, Ziping Xu, Yuhang Song, Meng Fang, Jiechao Xiong, Bo Dai, Zhengyou Zhang, and Bolei Zhou. Zerothorder supervised policy improvement. arXiv preprint arXiv:2006.06600, 2020.
 Thomas [2015] Philip S Thomas. Safe reinforcement learning. PhD thesis, University of Massachusetts Libraries, 2015.
 VlatakisGkaragkounis et al. [2019] EmmanouilVasileios VlatakisGkaragkounis, Lampros Flokas, and Georgios Piliouras. Efficiently avoiding saddle points with zero order methods: No gradients required. In Advances in Neural Information Processing Systems, volume 32, 2019.
 Williams et al. [2015] Grady Williams, Andrew Aldrich, and Evangelos Theodorou. Model predictive path integral control using covariance variable importance sampling. arXiv preprint arXiv:1509.01149, 2015.
 Wu et al. [2019] Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
 Yang et al. [2021] Yiqin Yang, Xiaoteng Ma, Chenghao Li, Zewu Zheng, Qiyuan Zhang, Gao Huang, Jun Yang, and Qianchuan Zhao. Believe what you see: Implicit constraint approach for offline multiagent reinforcement learning. arXiv preprint arXiv:2106.03400, 2021.
 Yu et al. [2021] Chao Yu, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of mappo in cooperative, multiagent games. arXiv preprint arXiv:2103.01955, 2021.
 Yu et al. [2020] Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Modelbased offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
Appendix A Proof of Theorem 1
Theorem 1. Let be the policy obtained by optimizing Eq. (2). Then, we have that
Proof.
For OMAR, we have the following iterative update for agent :
(4) 
where if and only if
Let be the fixed point of solving Equation (4) by setting the derivative of Eq. (4) with respect to to be , then we have that
(5) 
where is the indicator function.
Denote , and we obtain the difference between the value function and the original value function as:
(6) 
Then, the policy that minimizes the loss function defined in Eq. (2) is equivalently obtained by maximizing
(7) 
Appendix B More Details of the Experiments
b.1 Experimental Setup
Tasks.
We adopt the opensource implementations for multiagent particle environments^{5}^{5}5https://github.com/openai/multiagentparticleenvs from [32] and MultiAgent MuJoCo^{6}^{6}6https://github.com/schroederdewitt/multiagent_mujoco from [40]. Figure 7 illustrates the tasks. The expert and random scores for cooperative navigation, predatorprey, world, and twoagent HalfCheetah are , , , and , respectively.
Name  Agents  Enemies 

2s3z  2 Stalkers and 3 Zealots  2 Stalkers and 3 Zealots 
3s5z  3 Stalkers and 5 Zealots  3 Stalkers and 5 Zealots 
1c3s5z  1 Colossi, 3 Stalkers and 5 Zealots  1 Colossi, 3 Stalkers and 5 Zealots 
2c_vs_64zg  2 Colossi  64 Zerglings 
Baselines.
All baseline methods are implemented based on an opensource implementation^{7}^{7}7https://github.com/shariqiqbal2810/maddpgpytorch from [17], where we implement MATD3+BC^{8}^{8}8https://github.com/sfujim/TD3_BC, MACQL^{9}^{9}9https://github.com/aviralkumar2907/CQL, and MAICQ^{10}^{10}10https://github.com/YiqinYang/ICQ based on authors’ opensource implementations with finetuned hyperparameters. For MACQL, we tune a best critic regularization coefficient from following [27] for each task. Specifically, we use the discount factor of . We sample a minibatch of samples from the dataset for updating each agent’s actor and critic using the Adam [24] optimizer with the learning rate to be . The target networks for the actor and critic are soft updated with the update rate to be . Both the actor and critic networks are feedforward networks consisting of two hidden layers with neurons per layer using ReLU activation. For OMAR, the only hyperparameter that requires tuning is the regularization coefficient , where we use a smaller value for datasets with more diverse data distribution in random and mediumreplay with a value of , while we use a larger value for datasets with more narrow data distribution in medium and expert with values of and respectively. As OMAR is insensitive to the hyperparameters of the sampling mechanism, we set them to a fixed set of values for all types of datasets in all tasks, where the number of iteration is , the number of samples is , the mean is , and the standard deviation is . The code will be released upon publication of the paper.
b.2 Learning Curves
Figure 8 demonstrates the learning curves of MAICQ, MATD3+BC, MACQL and OMAR in different types of datasets in multiagent particle environments, where the solid line and shaded region represent mean and standard deviation, respectively.
b.3 Analysis of how cooperation affect the performance of CQL in multiagent tasks
We consider a noncooperative version of the Spread task in Figure 1(a) which involves agents and landmarks, where each of the agents aims to navigate to its own unique target landmark. In contrast to the Spread task that requires cooperation, the reward function for each agent only depends on its distance to its target landmark. This is a variant of Spread that consists of multiple independent learning agents, and the performance is measured by the average return over all agents.
Figure 9 shows the result of the performance improvement percentage of MACQL over the behavior policy in the independent Spread task. As shown, the performance of CQL does not degrade with an increasing number of agents in this setting that does not require cooperation, unlike a dramatic performance decrease in the cooperative Spread task in Figure 1(c). The result further confirms that the issue we discovered is due to the failure of coordination.
b.4 Additional Results of OMAR in SingleAgent Environments
Besides the singleagent setting of the Spread task we have shown in Figure 2, we also evaluate the effectiveness of our method in singleagent tasks by comparing it with CQL in the Maze2D domain from the D4RL benchmark [10]. Table 8 shows the results in an increasing order of complexity of the maze (maze2dumaze, maze2dmedium, maze2dlarge). Based on the results in Table 8 and Figure 2, we observe that OMAR performs much better than CQL, which indicates that OMAR is effective in the offline singleagent tasks.
maze2dumaze  maze2dmedium  maze2dlarge  

CQL  
OMAR 
b.5 The effect of the size of the dataset
In this section, we conduct an ablation study to investigate the effect of the size of the dataset following the experimental protocol in Agarwal et al. [2]. We first generate a full replay dataset by recording all samples in the replay buffer encountered during the training course for million steps. Then, we randomly sample experiences from the full replay dataset and obtain several smaller datasets with the same data distribution, where .
Figure 10 shows that the performance of MACQL increases given more data points for . However, it does not further increase given an even larger amount of data, which performs much worse than the fullytrained online agents and fails to recover their performance. On the contrary, OMAR always outperforms MACQL by a large margin when , whose performance is much closer to the fullytrained online agents given more data points. Therefore, the optimality issue still persists when dataset size becomes larger (e.g., it can take a very long time to escape from them if the objective contains very flat regions [3]). In addition, the zerothorder optimizer part in OMAR can better guide the actor given a larger amount of data points with a more accurate value function.
b.6 Discussion about the centralized and decentralized critics in offline multiagent RL
We attribute the lower performance in Table 5 (based on a centralized value function) compared to Table 3 (based on a decentralized value function) due to the base algorithm, where Table 9 shows the performance comparison of offline independent TD3 and offline multiagent TD3 in different types of dataset in cooperative navigation. As shown, utilizing centralized critics underperforms decentralized critics in the offline setting. There has also been recent research [8, 34] showing the benefits of decentralized value functions compared to a centralized one, which leads to a more robust performance. We attribute the performance loss of CTDE in the offline setting due to a more complex and higherdimensional value function conditioning on all agent’s actions and the global state that is harder to learn well without exploration.
Random  Mediumreplay  Medium  Expert  

ITD3  
MATD3 