Multiagent Reinforcement Learning:Rollout and Policy Iteration

doi:10.1109/JAS.2021.1003814

CORC > 自动化研究所 > 中国科学院自动化研究所 > 学术期刊 > IEEE/CAA Journal of Automatica Sinica

	Multiagent Reinforcement Learning:Rollout and Policy Iteration
	Dimitri Bertsekas
刊名	IEEE/CAA Journal of Automatica Sinica
	2021
卷号	8 期号:2 页码:249-272
关键词	Dynamic programming multiagent problems neuro-dynamic programming policy iteration reinforcement learning, rollout
ISSN号	2329-9266
DOI	10.1109/JAS.2021.1003814
英文摘要	We discuss the solution of complex multistage decision problems using methods that are based on the idea of policy iteration (PI), i.e., start from some base policy and generate an improved policy. Rollout is the simplest method of this type, where just one improved policy is generated. We can view PI as repeated application of rollout, where the rollout policy at each iteration serves as the base policy for the next iteration. In contrast with PI, rollout has a robustness property: it can be applied on-line and is suitable for on-line replanning. Moreover, rollout can use as base policy one of the policies produced by PI, thereby improving on that policy. This is the type of scheme underlying the prominently successful AlphaZero chess program.In this paper we focus on rollout and PI-like methods for problems where the control consists of multiple components each selected (conceptually) by a separate agent. This is the class of multiagent problems where the agents have a shared objective function, and a shared and perfect state information. Based on a problem reformulation that trades off control space complexity with state space complexity, we develop an approach, whereby at every stage, the agents sequentially (one-at-a-time) execute a local rollout algorithm that uses a base policy, together with some coordinating information from the other agents. The amount of total computation required at every stage grows linearly with the number of agents. By contrast, in the standard rollout algorithm, the amount of total computation grows exponentially with the number of agents. Despite the dramatic reduction in required computation, we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout: it guarantees an improved performance relative to the base policy. We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information, which is sufficient to maintain the cost improvement property, without any on-line coordination of control selection between the agents.For discounted and other infinite horizon problems, we also consider exact and approximate PI algorithms involving a new type of one-agent-at-a-time policy improvement operation. For one of our PI algorithms, we prove convergence to an agent-by-agent optimal policy, thus establishing a connection with the theory of teams. For another PI algorithm, which is executed over a more complex state space, we prove convergence to an optimal policy. Approximate forms of these algorithms are also given, based on the use of policy and value neural networks. These PI algorithms, in both their exact and their approximate form are strictly off-line methods, but they can be used to provide a base policy for use in an on-line multiagent rollout scheme.
内容类型	期刊论文
源URL	[http://ir.ia.ac.cn/handle/173211/43912]
专题	自动化研究所_学术期刊_IEEE/CAA Journal of Automatica Sinica
推荐引用方式 GB/T 7714	Dimitri Bertsekas. Multiagent Reinforcement Learning:Rollout and Policy Iteration[J]. IEEE/CAA Journal of Automatica Sinica,2021,8(2):249-272.
APA	Dimitri Bertsekas.(2021).Multiagent Reinforcement Learning:Rollout and Policy Iteration.IEEE/CAA Journal of Automatica Sinica,8(2),249-272.
MLA	Dimitri Bertsekas."Multiagent Reinforcement Learning:Rollout and Policy Iteration".IEEE/CAA Journal of Automatica Sinica 8.2(2021):249-272.