基于可中断Option的在线分层强化学习方法

朱斐; 许志鹏; 刘全; 伏玉琛; 王辉

doi:10.11959/j.issn.1000-436x.2016117

您当前的位置：

首页 >

文章列表页 >

基于可中断Option的在线分层强化学习方法

学术论文 | 更新时间：2024-06-05

- 基于可中断Option的在线分层强化学习方法
- Online hierarchical reinforcement learning based on interrupting Option
- 通信学报 2016年37卷第6期页码：65-74
- 作者机构：
  
  1. 苏州大学计算机科学与技术学院，江苏苏州215006
  2. 吉林大学符号计算与知识工程教育部重点实验室，吉林长春130012
- 作者简介：
  
  [ "朱斐（1978-），男，江苏苏州人，博士，苏州大学副教授，主要研究方向为机器学习、人工智能、生物信息学等。" ]
  [ "许志鹏（1991-），男，湖北荆州人，苏州大学硕士生，主要研究方向为强化学习、人工智能等。" ]
  [ "刘全（1969-），男，内蒙古牙克石人，博士后，苏州大学教授、博士生导师，主要研究方向为多强化学习、人工智能、自动推理等。" ]
  [ "伏玉琛（1968-），男，江苏徐州人，博士，苏州大学教授、硕士生导师，主要研究方向为强化学习、人工智能等。" ]
  [ "王辉（1968-），男，陕西西安人，苏州大学讲师，主要研究方向为强化学习、人工智能等。" ]
- 基金信息：
  
  国家自然科学基金资助项目(61303108);国家自然科学基金资助项目(61373094);国家自然科学基金资助项目(61272005);国家自然科学基金资助项目(61472262);江苏省高校自然科学研究基金资助项目(13KJB520020);吉林大学符号计算与知识工程教育部重点实验室基金资助项目(93K172014K04);苏州市应用基础研究计划基金资助项目(SYG201422);苏州大学高校省级重点实验室基金资助项目(KJS1524);中国国家留学基金资助项目(201606920013)
- DOI：10.11959/j.issn.1000-436x.2016117
  中图分类号： TP181
- 网络出版日期：2016-06，
  
  纸质出版日期：2016-06-25
- 稿件说明：
移动端阅览
朱斐, 许志鹏, 刘全, 等. 基于可中断Option的在线分层强化学习方法[J]. 通信学报, 2016,37(6):65-74.

Fei ZHU, Zhi-peng XU, Quan LIU, et al. Online hierarchical reinforcement learning based on interrupting Option[J]. Journal on communications, 2016, 37(6): 65-74.
朱斐, 许志鹏, 刘全, 等. 基于可中断Option的在线分层强化学习方法[J]. 通信学报, 2016,37(6):65-74. DOI： 10.11959/j.issn.1000-436x.2016117.

Fei ZHU, Zhi-peng XU, Quan LIU, et al. Online hierarchical reinforcement learning based on interrupting Option[J]. Journal on communications, 2016, 37(6): 65-74. DOI： 10.11959/j.issn.1000-436x.2016117.

摘要

针对大数据体量大的问题，在Macro-Q算法的基础上提出了一种在线更新的Macro-Q算法(MQIU)，同时更新抽象动作的值函数和元动作的值函数，提高了数据样本的利用率。针对传统的马尔可夫过程模型和抽象动作均难于应对可变性，引入中断机制，提出了一种可中断抽象动作的Macro-Q无模型学习算法(IMQ)，能在动态环境下学习并改进控制策略。仿真结果验证了MQIU算法能加快算法收敛速度，进而能解决更大规模的问题，同时也验证了IMQ算法能够加快任务的求解，并保持学习性能的稳定性。

Abstract

Aiming at dealing with volume of big data

an on-line updating algorithm

named by Macro-Q with in-place updating (MQIU)

which was based on Macro-Q algorithm and takes advantage of in-place updating approach

was proposed.The MQIU algorithm updates both the value function of abstract action and the value function of primitive action

and hence speeds up the convergence rate.By introducing the interruption mechanism

a model-free interrupting Macro-Q Option learning algorithm(IMQ)

which was based on hierarchical reinforcement learning

was also introduced to order to handle the variability which was hard to process by the conventional Markov decision process model and abstract action so that IMQ was able to learn and improve control strategies in a dynamic environment.Simulations verify the MQIU algorithm speeds up the convergence rate so that it is able to do with the larger scale of data

and the IMQ algorithm solves the task faster with a stable learning performance.

关键词

Keywords

references

OTTERLO M V , WIERING M . Reinforcement learning and Markov decision processes [J ] . Adaptation Learning ＆Optimization , 2012 , 206 ( 4 ): 3 - 42 .

VAN H H . Reinforcement learning:state of the art [M ] . Berlin : SpringerPress , 2007 .

沈晶 , 顾国昌 , 刘海波 . 未知动态环境中基于分层强化学习的移动机器人路径规划 [J ] . 机器人 , 2006 , 28 ( 5 ): 544 - 547 .

SHEN J , GU G C , LIU H B . Mobile robot path planning based on hierarchical reinforcement learning in unknown dynamic environment [J ] . ROBOT , 2006 , 28 ( 5 ): 544 - 547 .

刘全 , 闫其粹 , 伏玉琛 , 等 . 一种基于启发式奖赏函数的分层强化学习方法 [J ] . 计算机研究与发展 , 2011 , 48 ( 12 ): 2352 - 2358 .

LIU Q , YAN Q C , FU Y C , et al . A hierarchical reinforcement learning method based on heuristic reward function [J ] . Journal of Computer Research and Development , 2011 , 48 ( 12 ): 2352 - 2358 .

陈兴国 , 高阳 , 范顺国 , 等 . 基于核方法的连续动作Actor-Critic学习 [J ] . 模式识别与人工智能 , 2014 ( 2 ): 103 - 110 .

CHEN X G , GAO Y , FAN S G , et al . Kernel-based continuous-action actor-critic learning [J ] . Pattern Recognition and Artificial Intelligence , 2014 ( 2 ): 103 - 110 .

朱斐 , 刘全 , 傅启明 , 等 . 一种用于连续动作空间的最小二乘行动者-评论家方法 [J ] . 计算机研究与发展 , 2014 , 51 ( 3 ): 548 - 558 .

ZHU F , LIU Q , FU Q M , et al . A least square actor-critic approach for continuous action space [J ] . Journal of Computer Research and Development , 2014 , 51 ( 3 ): 548 - 558 .

唐昊 , 张晓艳 , 韩江洪 , 等 . 基于连续时间半马尔可夫决策过程的Option算法 [J ] . 计算机学报 , 2014 ( 9 ): 2027 - 2037 .

TANG H , ZHANG X Y , HAN J H , et al . Option algorithm based on continuous-time semi-Markov decision process [J ] . Chinese Journal of Computers , 2014 ( 9 ): 2027 - 2037 .

SUTTON R S , PRECUP D , SINGH S . Between MDPs and semi-MDPs:a framework for temporal abstraction in reinforcement learning [J ] . Artificial Intelligence , 1999 , 112 ( 1 ): 181 - 211 .

MCGOVERN A , BARTO A G . Automatic discovery of subgoals in reinforcement learning using diverse density [J ] . Computer Science Department Faculty Publication Series , 2001 ( 8 ): 361 - 368 .

ŞIMŞEK Ö , WOLFE A P , BARTO A G . Identifying useful subgoals in reinforcement learning by local graph partitioning [C ] // The 22nd International Conference on Machine Learning . ACM , 2005 : 816 - 823 .

ŞIMŞEK Ö , BARTO A G , . Using relative novelty to identify useful temporal abstractions in reinforcement learning [C ] // The Twenty-first International Conference on Machine Learning . ACM , 2004 : 751 - 758 .

CHAGANTY A T , GAUR P , RAVINDRAN B . Learning in a small world [C ] // The 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1.International Foundation for Autonomous Agents and Multiagent Systems . 2012 : 391 - 397 .

SUTTON R S , SINGH S , PRECUP D , et al . Improved switching among temporally abstract actions [J ] . Advances in Neural Information Processing Systems , 1999 : 1066 - 1072 .

CASTRO P S , PRECUP D . Automatic construction of temporally extended actions for mdps using bisimulation metrics [C ] // European Conference on Recent Advances in Reinforcement Learning . Springer-Verlag , 2011 : 140 - 152 .

何清 , 李宁 , 罗文娟 , 等 . 大数据下的机器学习算法综述 [J ] . 模式识别与人工智能 , 2014 , 27 ( 4 ): 327 - 336 .

HE Q , LI N , LUO W J , et al . A survey of machine learning algorithms for big data [J ] . Pattern Recognition and Artificial Intelligence , 2014 , 27 ( 4 ): 327 - 336 .

SUTTON R S , PRECUP D , SINGH S P . Intra-option learning about temporally abstract actions [C ] // ICML . 1998 , 98 : 556 - 564 .

石川 , 史忠植 , 王茂光 . 基于路径匹配的在线分层强化学习方法 [J ] . 计算机研究与发展 , 2008 , 45 ( 9 ): 1470 - 1476 .

SHI C , SHI Z Z , WANG M G . Online hierarchical reinforcement learning based on path-matching [J ] . Journal of Computer Research and Development , 2008 , 45 ( 9 ): 1470 - 1476 .

BOTVINICK M M . Hierarchical reinforcement learning and decision making [J ] . Current Opinion in Neurobiology , 2012 , 22 ( 6 ): 956 - 962 .

王爱平 , 万国伟 , 程志全 , 等 . 支持在线学习的增量式极端随机森林分类器 [J ] . 软件学报 , 2011 , 22 ( 9 ): 2059 - 2074 .

WANG A P , WAN G W , CHENG Z Q , et al . Incremental learning extremely random forest classifier for online learning [J ] . Journal of Software , 2011 , 22 ( 9 ): 2059 - 2074 .

浏览量

778

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于软提示微调和强化学习的网络安全命名实体识别方法研究

基于审计博弈的安全协作频谱感知方案

算力网络中面向计算重用的任务调度优化

基于强化学习的在线离线混部云环境下的调度框架

基于深度强化学习的微服务多维动态防御策略研究