浏览全部资源
扫码关注微信
1. 苏州大学 计算机科学与技术学院,江苏 苏州 215006
2. 吉林大学 符号计算与知识工程教育部重点实验室,吉林 长春 130012
[ "于俊(1989-),男,江苏泰州人,苏州大学硕士生,主要研究方向为强化学习和贝叶斯推理。" ]
[ "刘全(1969-),男,内蒙古牙克石人,苏州大学教授、博士生导师,主要研究方向为强化学习、智能信息处理和自动推理。" ]
[ "傅启明(1985-),男,江苏淮安人,苏州大学博士生,主要研究方向为强化学习、贝叶斯推理和遗传算法。" ]
[ "孙洪坤(1988-),男,江苏淮安人,苏州大学硕士生,主要研究方向为强化学习。" ]
[ "陈桂兴(1990-),男,江西赣州人,苏州大学硕士生,主要研究方向为强化学习和模式识别。" ]
网络出版日期:2013-11,
纸质出版日期:2013-11-25
移动端阅览
于俊, 刘全, 傅启明, 等. 基于优先级扫描Dyna结构的贝叶斯Q学习方法[J]. 通信学报, 2013,34(11):129-139.
Jun YU, Quan LIU, Qi-ming FU, et al. Bayesian Q learning method with Dyna architecture and prioritized sweeping[J]. Communication journal, 2013, 34(11): 129-139.
于俊, 刘全, 傅启明, 等. 基于优先级扫描Dyna结构的贝叶斯Q学习方法[J]. 通信学报, 2013,34(11):129-139. DOI: 10.3969/j.issn.1000-436x.2013.11.015.
Jun YU, Quan LIU, Qi-ming FU, et al. Bayesian Q learning method with Dyna architecture and prioritized sweeping[J]. Communication journal, 2013, 34(11): 129-139. DOI: 10.3969/j.issn.1000-436x.2013.11.015.
贝叶斯Q学习方法使用概率分布来描述Q值的不确定性,并结合Q值分布来选择动作,以达到探索与利用的平衡。然而贝叶斯Q学习存在着收敛速度慢且收敛精度低的问题。针对上述问题,提出一种基于优先级扫描Dyna结构的贝叶斯Q学习方法—Dyna-PS-BayesQL。该方法主要分为2部分:在学习部分,对环境的状态迁移函数及奖赏函数建模,并使用贝叶斯Q学习更新动作值函数的参数;在规划部分,基于建立的模型,使用优先级扫描方法和动态规划方法对动作值函数进行规划更新,以提高对历史经验信息的利用,从而提升方法收敛速度及收敛精度。将Dyna-PS-BayesQL应用于链问题和迷宫导航问题,实验结果表明,该方法能较好地平衡探索与利用,且具有较优的收敛速度及收敛精度。
In order to balance this trade-off
a probability distribution was used in Bayesian Q learning method to de-scribe the uncertainty of the Q value and choose actions with this distribution. But the slow convergence is a big problem for Bayesian Q-Learning. In allusion to the above problems
a novel B ian Q learning algorithm with Dyna architec-ture and prioritized sweeping
called Dyna-PS-BayesQL was proposed. The algorithm mainly includes two parts: in the learning part
it models the transition function and reward function according to collected samples
and update Q value function by Bayesian Q-learning
in the programming part
it updates the Q value function by using prioritized sweeping and dynamic programming methods based on the constructed model
which can improve the efficiency of using the his-torical information. Applying the Dyna-PS-BayesQL to the chain problem and maze navigation problem
the results show that the proposed algorithm can get a good performance of balancing the exploration and exploitation in the learning process
and get a better convergence performance.
SUTTON R S , BARTO A G . Reinforcement Learning: An Introduc-tion [M ] . Cambridge : MIT Press 1998 .
徐昕 . 增强学习与近似动态规划 [M ] . 北京 : 科学出版社 , 2010 .
XU X . Reinforcement Learning and Approximate Dynamic Program-ming [M ] . Beijing : Science Press , 2010 .
刘全 , 傅启明 , 龚声蓉 等 . 最小状态变元平均奖赏的强化学习方法 [J ] . 通信学报 , 2011 , 32 ( 1 ): 66 - 71 .
LIU Q , FU Q M , GONG S R , et al . Reinforcement learning algorithm based on minimum state method and average reward [J ] . Journal on Communications , 2011 , 32 ( 1 ): 66 - 71 .
肖飞 , 刘全 , 傅启明 等 . 基于自适应势函数塑造奖赏机制的梯度下降Sarsa (?) 算法 [J ] . 通信学报 , 2013 , 34 ( 1 ): 77 - 88 .
XIAO F , LIU Q , FU Q M , et al . Gradient descent Sarsa(?)algorithm based on the adaptive potential function shaping reward mechanism [J ] . Journal on Communications , 2013 , 34 ( 1 ): 77 - 88 .
SZEPESVÁRI C . Algorithms for Reinforcement Learning [M ] . San Rafael : Morgan Claypool 2010 .
WATKINS C . Learning From Delayed Rewards [D ] . Cambridge : Kings's College, University of Cambridge 1989 .
SUTTON R S . Dyna, an integrated architecture for learning, planning, and reacting [J ] . SIGART Bulletin , 1991 , 2 : 160 - 163 .
SUTTON R S , SZEPESVÁRI C , GERAMIFARD A , et al . Dyna-style planning with linear function approximation and prioritized sweep-ing [A ] . Proceedings of the 24th Conference on Uncertai y in Artifi-cial Intelligence [C ] . Finland: AUAI , 2008 .
WINGATE D , SEPPI K D . Prioritized methods for accelerating MDP solvers [J ] . Journal of Machine Learning Research , 2005 , 6 : 851 - 881 .
MEULEAU N , BOURGINE P . Exploration of multi-state environ-ments: local measures and back-propagation of uncertainty [J ] . Ma-chine Learning , 1999 , 35 ( 2 ): 117 - 154 .
COGGAN M . Exploration and exploitation in reinforcement learn-ing [A ] . Proceedings of the 4th International Conference on Computa-tional Intelligence and Multimedia Applications [C ] . Japan , 2001 .
ALEXANDER L , STREHL , MICHAEL L . A theoretical analysis of mod-el-based interval estimation [A ] . Proceedings of the 22nd International Conference on Machine Learning [C ] . New York : ACM , 2005 .
MEULEAU N , BOURGINE P . Exploration of multi-state environ-ments: local measures and back-propagation of uncertainty [J ] . Ma-chine Learning , 1999 , 35 ( 2 ): 117 - 154 .
DEARDEN R , FRIEDMAN N , RUSSELL S . Bayesian Q learning [A ] . Proceedings of 15th International Conference on Artifi ial Intelli-gence [C ] . Menlo Park : AAAI Press , 1998 .
DEARDEN R , FRIEDMAN N , ANDRE D . Model based Bayesian exploration [A ] . Proceedings of 15th Conference on Uncertainty in Ar-tificial Intelligence [C ] . San Francisco : Morgan Kaufmann , 1999 .
ASMUTH J , MICHAEL L , et al Potential-based shaping in mod-el-based reinforcement learning [A ] . Proceedings of the 23th AAAI Conference on Artificial Intelligence [C ] . Chicago : AAAI Press , 2008 .
PENG J , WILLIAMS R J . Efficient learning and planning within the dyna framework [J ] . Adaptive Behavior , 1993 , 2 : 437 - 454 .
DEGROOT M , SCHERVISH M . Probability and Statistics [M ] . New York : Person Edition 2010 .
TEACY W , CHALKIADAKIS G , FARINELLI A . Decentralised Bayesian reinforcement learning for online agent collaboration [A ] . Proceedings of 11th International Joint Conference on tonomous Agents and Multi-Agent Systems [C ] . Spain : IFAAMAS , 2012 .
0
浏览量
0
下载量
4
CSCD
关联资源
相关文章
相关作者
相关机构