Multiagent Reinforcement Learning:Rollout and Policy Iteration
Dimitri Bertsekas
刊名IEEE/CAA Journal of Automatica Sinica
2021
卷号8期号:2页码:249-272
关键词Dynamic programming multiagent problems neuro-dynamic programming policy iteration reinforcement learning, rollout
ISSN号2329-9266
DOI10.1109/JAS.2021.1003814
英文摘要We discuss the solution of complex multistage decision problems using methods that are based on the idea of policy iteration (PI), i.e., start from some base policy and generate an improved policy. Rollout is the simplest method of this type, where just one improved policy is generated. We can view PI as repeated application of rollout, where the rollout policy at each iteration serves as the base policy for the next iteration. In contrast with PI, rollout has a robustness property: it can be applied on-line and is suitable for on-line replanning. Moreover, rollout can use as base policy one of the policies produced by PI, thereby improving on that policy. This is the type of scheme underlying the prominently successful AlphaZero chess program.In this paper we focus on rollout and PI-like methods for problems where the control consists of multiple components each selected (conceptually) by a separate agent. This is the class of multiagent problems where the agents have a shared objective function, and a shared and perfect state information. Based on a problem reformulation that trades off control space complexity with state space complexity, we develop an approach, whereby at every stage, the agents sequentially (one-at-a-time) execute a local rollout algorithm that uses a base policy, together with some coordinating information from the other agents. The amount of total computation required at every stage grows linearly with the number of agents. By contrast, in the standard rollout algorithm, the amount of total computation grows exponentially with the number of agents. Despite the dramatic reduction in required computation, we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout: it guarantees an improved performance relative to the base policy. We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information, which is sufficient to maintain the cost improvement property, without any on-line coordination of control selection between the agents.For discounted and other infinite horizon problems, we also consider exact and approximate PI algorithms involving a new type of one-agent-at-a-time policy improvement operation. For one of our PI algorithms, we prove convergence to an agent-by-agent optimal policy, thus establishing a connection with the theory of teams. For another PI algorithm, which is executed over a more complex state space, we prove convergence to an optimal policy. Approximate forms of these algorithms are also given, based on the use of policy and value neural networks. These PI algorithms, in both their exact and their approximate form are strictly off-line methods, but they can be used to provide a base policy for use in an on-line multiagent rollout scheme.
内容类型期刊论文
源URL[http://ir.ia.ac.cn/handle/173211/43912]  
专题自动化研究所_学术期刊_IEEE/CAA Journal of Automatica Sinica
推荐引用方式
GB/T 7714
Dimitri Bertsekas. Multiagent Reinforcement Learning:Rollout and Policy Iteration[J]. IEEE/CAA Journal of Automatica Sinica,2021,8(2):249-272.
APA Dimitri Bertsekas.(2021).Multiagent Reinforcement Learning:Rollout and Policy Iteration.IEEE/CAA Journal of Automatica Sinica,8(2),249-272.
MLA Dimitri Bertsekas."Multiagent Reinforcement Learning:Rollout and Policy Iteration".IEEE/CAA Journal of Automatica Sinica 8.2(2021):249-272.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace