基于优先级扫描Dyna结构的贝叶斯Q学习方法

于俊; 刘全; 傅启明; 孙洪坤; 陈桂兴

doi:10.3969/j.issn.1000-436x.2013.11.015

您当前的位置：

首页 >

文章列表页 >

基于优先级扫描Dyna结构的贝叶斯Q学习方法

学术论文

- 基于优先级扫描Dyna结构的贝叶斯Q学习方法
- Bayesian Q learning method with Dyna architecture and prioritized sweeping
- 通信学报 2013年34卷第11期页码：129-139
- 作者机构：
  
  1. 苏州大学计算机科学与技术学院，江苏苏州 215006
  2. 吉林大学符号计算与知识工程教育部重点实验室，吉林长春 130012
- 作者简介：
  
  [ "于俊（1989-），男，江苏泰州人，苏州大学硕士生，主要研究方向为强化学习和贝叶斯推理。" ]
  [ "刘全（1969-），男，内蒙古牙克石人，苏州大学教授、博士生导师，主要研究方向为强化学习、智能信息处理和自动推理。" ]
  [ "傅启明（1985-），男，江苏淮安人，苏州大学博士生，主要研究方向为强化学习、贝叶斯推理和遗传算法。" ]
  [ "孙洪坤（1988-），男，江苏淮安人，苏州大学硕士生，主要研究方向为强化学习。" ]
  [ "陈桂兴（1990-），男，江西赣州人，苏州大学硕士生，主要研究方向为强化学习和模式识别。" ]
- 基金信息：
  
  国家自然科学基金资助项目(61070223);国家自然科学基金资助项目(61103045);国家自然科学基金资助项目(61070122);国家自然科学基金资助项目(61272005);江苏省自然科学基金资助项目(BK2012616);江苏省高校自然科学研究基金资助项目(09KJA520002);江苏省高校自然科学研究基金资助项目(09KJB520012);吉林大学符号计算与知识工程教育部重点实验室基金资助项目(93K172012K04)
- DOI：10.3969/j.issn.1000-436x.2013.11.015
  中图分类号： TP181
- 网络出版日期：2013-11，
  
  纸质出版日期：2013-11-25
- 稿件说明：
移动端阅览
于俊, 刘全, 傅启明, 等. 基于优先级扫描Dyna结构的贝叶斯Q学习方法[J]. 通信学报, 2013,34(11):129-139.

Jun YU, Quan LIU, Qi-ming FU, et al. Bayesian Q learning method with Dyna architecture and prioritized sweeping[J]. Communication journal, 2013, 34(11): 129-139.
于俊, 刘全, 傅启明, 等. 基于优先级扫描Dyna结构的贝叶斯Q学习方法[J]. 通信学报, 2013,34(11):129-139. DOI： 10.3969/j.issn.1000-436x.2013.11.015.

Jun YU, Quan LIU, Qi-ming FU, et al. Bayesian Q learning method with Dyna architecture and prioritized sweeping[J]. Communication journal, 2013, 34(11): 129-139. DOI： 10.3969/j.issn.1000-436x.2013.11.015.

摘要

贝叶斯Q学习方法使用概率分布来描述Q值的不确定性，并结合Q值分布来选择动作，以达到探索与利用的平衡。然而贝叶斯Q学习存在着收敛速度慢且收敛精度低的问题。针对上述问题，提出一种基于优先级扫描Dyna结构的贝叶斯Q学习方法—Dyna-PS-BayesQL。该方法主要分为2部分：在学习部分，对环境的状态迁移函数及奖赏函数建模，并使用贝叶斯Q学习更新动作值函数的参数；在规划部分，基于建立的模型，使用优先级扫描方法和动态规划方法对动作值函数进行规划更新，以提高对历史经验信息的利用，从而提升方法收敛速度及收敛精度。将Dyna-PS-BayesQL应用于链问题和迷宫导航问题，实验结果表明，该方法能较好地平衡探索与利用，且具有较优的收敛速度及收敛精度。

Abstract

In order to balance this trade-off

a probability distribution was used in Bayesian Q learning method to de-scribe the uncertainty of the Q value and choose actions with this distribution. But the slow convergence is a big problem for Bayesian Q-Learning. In allusion to the above problems

a novel B ian Q learning algorithm with Dyna architec-ture and prioritized sweeping

called Dyna-PS-BayesQL was proposed. The algorithm mainly includes two parts: in the learning part

it models the transition function and reward function according to collected samples

and update Q value function by Bayesian Q-learning

in the programming part

it updates the Q value function by using prioritized sweeping and dynamic programming methods based on the constructed model

which can improve the efficiency of using the his-torical information. Applying the Dyna-PS-BayesQL to the chain problem and maze navigation problem

the results show that the proposed algorithm can get a good performance of balancing the exploration and exploitation in the learning process

and get a better convergence performance.

关键词

Keywords

references

SUTTON R S , BARTO A G . Reinforcement Learning: An Introduc-tion [M ] . Cambridge : MIT Press 1998 .

徐昕 . 增强学习与近似动态规划 [M ] ．北京 : 科学出版社 , 2010 .

XU X . Reinforcement Learning and Approximate Dynamic Program-ming [M ] . Beijing : Science Press , 2010 .

刘全，傅启明，龚声蓉等 . 最小状态变元平均奖赏的强化学习方法 [J ] . 通信学报 , 2011 , 32 ( 1 ): 66 - 71 .

LIU Q , FU Q M , GONG S R , et al . Reinforcement learning algorithm based on minimum state method and average reward [J ] . Journal on Communications , 2011 , 32 ( 1 ): 66 - 71 .

肖飞，刘全，傅启明等 . 基于自适应势函数塑造奖赏机制的梯度下降Sarsa (?) 算法 [J ] . 通信学报 , 2013 , 34 ( 1 ): 77 - 88 .

XIAO F , LIU Q , FU Q M , et al . Gradient descent Sarsa(?)algorithm based on the adaptive potential function shaping reward mechanism [J ] . Journal on Communications , 2013 , 34 ( 1 ): 77 - 88 .

SZEPESVÁRI C . Algorithms for Reinforcement Learning [M ] . San Rafael : Morgan Claypool 2010 .

WATKINS C . Learning From Delayed Rewards [D ] . Cambridge : Kings's College, University of Cambridge 1989 .

SUTTON R S . Dyna, an integrated architecture for learning, planning, and reacting [J ] . SIGART Bulletin , 1991 , 2 : 160 - 163 .

SUTTON R S , SZEPESVÁRI C , GERAMIFARD A , et al . Dyna-style planning with linear function approximation and prioritized sweep-ing [A ] . Proceedings of the 24th Conference on Uncertai y in Artifi-cial Intelligence [C ] . Finland: AUAI , 2008 .

WINGATE D , SEPPI K D . Prioritized methods for accelerating MDP solvers [J ] . Journal of Machine Learning Research , 2005 , 6 : 851 - 881 .

MEULEAU N , BOURGINE P . Exploration of multi-state environ-ments: local measures and back-propagation of uncertainty [J ] . Ma-chine Learning , 1999 , 35 ( 2 ): 117 - 154 .

COGGAN M . Exploration and exploitation in reinforcement learn-ing [A ] . Proceedings of the 4th International Conference on Computa-tional Intelligence and Multimedia Applications [C ] . Japan , 2001 .

ALEXANDER L , STREHL , MICHAEL L . A theoretical analysis of mod-el-based interval estimation [A ] . Proceedings of the 22nd International Conference on Machine Learning [C ] . New York : ACM , 2005 .

MEULEAU N , BOURGINE P . Exploration of multi-state environ-ments: local measures and back-propagation of uncertainty [J ] . Ma-chine Learning , 1999 , 35 ( 2 ): 117 - 154 .

DEARDEN R , FRIEDMAN N , RUSSELL S . Bayesian Q learning [A ] . Proceedings of 15th International Conference on Artifi ial Intelli-gence [C ] . Menlo Park : AAAI Press , 1998 .

DEARDEN R , FRIEDMAN N , ANDRE D . Model based Bayesian exploration [A ] . Proceedings of 15th Conference on Uncertainty in Ar-tificial Intelligence [C ] . San Francisco : Morgan Kaufmann , 1999 .

ASMUTH J , MICHAEL L , et al Potential-based shaping in mod-el-based reinforcement learning [A ] . Proceedings of the 23th AAAI Conference on Artificial Intelligence [C ] . Chicago : AAAI Press , 2008 .

PENG J , WILLIAMS R J . Efficient learning and planning within the dyna framework [J ] . Adaptive Behavior , 1993 , 2 : 437 - 454 .

DEGROOT M , SCHERVISH M . Probability and Statistics [M ] . New York : Person Edition 2010 .

TEACY W , CHALKIADAKIS G , FARINELLI A . Decentralised Bayesian reinforcement learning for online agent collaboration [A ] . Proceedings of 11th International Joint Conference on tonomous Agents and Multi-Agent Systems [C ] . Spain : IFAAMAS , 2012 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

网络结构自调整的柔性内涵初探

基于软提示微调和强化学习的网络安全命名实体识别方法研究

基于审计博弈的安全协作频谱感知方案

基于强化学习的在线离线混部云环境下的调度框架

基于深度强化学习的微服务多维动态防御策略研究