2048 への方策勾配法の適用

山下, 修平; 金子, 知適; Shuhei, Yamashita; Tomoyuki, Kaneko

WEKO3

インデックスツリー

RootNode

アイテム

2048 への方策勾配法の適用

https://ipsj.ixsq.nii.ac.jp/records/213453

名前 / ファイル	ライセンス	アクション
IPSJ-GPWS2021032.pdf (2.1 MB)	Copyright (c) 2021 by the Information Processing Society of Japan
オープンアクセス

Item type

Symposium(1)

公開日

2021-11-06

タイトル

2048 への方策勾配法の適用

タイトル

言語

タイトル

Application of the policy gradient method to 2048

言語

jpn

キーワード

主題Scheme

Other

主題

強化学習

キーワード

主題Scheme

Other

主題

2048

キーワード

主題Scheme

Other

主題

方策勾配法

キーワード

主題Scheme

Other

主題

Proximal Policy Optimization

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_5794

資源タイプ

conference paper

著者所属

東京大学教養学部学際科学科

著者所属

東京大学大学院総合文化研究科

著者所属(英)

Department of Interdisciplinary Sciences, The University of Tokyo

著者所属(英)

Graduate School of Arts and Sciences, The University ofTokyo

著者名

山下, 修平
金子, 知適

著者名(英)

Shuhei, Yamashita
Tomoyuki, Kaneko

論文抄録

内容記述タイプ

Other

内容記述

本稿では 2048 という確率的ゲームを題材に強化学習における方策勾配法の性能について研究する. 強化学習はエージェントが与えられた環境において試行錯誤を通じて最適な方策を学習するための手法である. 強化学習には大きく分けて状態や行動の価値関数を学習することで最適な方策を見つける手法と, 方策勾配定理に従って直接方策を改善していく手法の 2 つがある. 2048 においては Szubert らがTD-AFTERSTATE 学習を提案して以来, 主に前者のアプローチを主流としてハイスコアが更新されてきた. 本研究では方策勾配法による方策の学習が 2048 においても可能であることを示す. さらにエージェントに与える報酬は専らゲームスコアが使われてきたが, より長くエピソードが続くことを期待して 1 ステップごとに+1 としても同等以上の成果が得られることを示す.

論文抄録(英)

内容記述タイプ

Other

内容記述

This paper studies the effectiveness of policy gradient methods on a stochastic game 2048. Reinforcement learning is a method in which an agent learns an optimal policy through trial and error in a given environment. There are mainly two ways in reinforcement learning to ﬁnd an optimal policy: one is by learning state or action value functions, and the other is by directly improving its policy according to the policy gradient theorem. In 2048, the high scores achieved by AI agents have been updated mostly with the former approach since Szubert presented TD-AFTERSTATE learning. In this paper, we show that an agent can learn its policy by policy gradient method too. Also, games scores have been used exclusively as the reward to train agents until now. However, we show that the same or better results can be obtained if the agent is given +1 reward for each step so that an agent prefers longer episodes more.

書誌情報

ゲームプログラミングワークショップ2021論文集

巻 2021, p. 179-185, 発行日 2021-11-06

出版者

言語

出版者

情報処理学会

戻る

views

See details

	Views

Versions

Ver.1

2025-01-19 17:09:23.931553

Show All versions

Cite as

山下, 修平, 金子, 知適, 2021: 情報処理学会, 179–185 p.

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

2048 への方策勾配法の適用

× 山下, 修平

× 金子, 知適

× Shuhei, Yamashita

× Tomoyuki, Kaneko

Versions

Share

Cite as

エクスポート