浏览全部资源
扫码关注微信
1. 中国矿业大学物联网(感知矿山)研究中心,江苏 徐州 221008
2. 不来德大学电动学与微电子研究所,不来德 28359
[ "丁恩杰(1962- ),男,山东青岛人,博士,中国矿业大学教授,主要研究方向为工业物联网、模式识别、人员定位等" ]
[ "刘忠育(1985- ),男,河南辉县人,中国矿业大学博士生,主要研究方向为计算机视觉、自然语言处理等" ]
[ "刘亚峰(1985- ),男,江苏徐州人,博士,中国矿业大学助理研究员,主要研究方向为机器学习、计算机视觉、行为识别等" ]
[ "郁万里(1987- ),男,江苏徐州人,博士,不来梅大学在站博士后,主要研究方向为工业物联网、网络优化、移动边缘计算等" ]
网络出版日期:2020-02,
纸质出版日期:2020-02-25
移动端阅览
丁恩杰, 刘忠育, 刘亚峰, 等. 基于多维度和多模态信息的视频描述方法[J]. 通信学报, 2020,41(2):36-43.
Enjie DING, Zhongyu LIU, Yafeng LIU, et al. Video description method based on multidimensional and multimodal information[J]. Journal on communications, 2020, 41(2): 36-43.
丁恩杰, 刘忠育, 刘亚峰, 等. 基于多维度和多模态信息的视频描述方法[J]. 通信学报, 2020,41(2):36-43. DOI: 10.11959/j.issn.1000-436x.2020037.
Enjie DING, Zhongyu LIU, Yafeng LIU, et al. Video description method based on multidimensional and multimodal information[J]. Journal on communications, 2020, 41(2): 36-43. DOI: 10.11959/j.issn.1000-436x.2020037.
针对视频自动描述任务中的复杂信息表征问题,提出一种多维度和多模态视觉特征的提取和融合方法。首先通过迁移学习提取视频序列的静态和动态等多维度特征,并采用图像描述算法提取视频关键帧的语义信息,完成视频信息的特征表征;然后采用多层长短期记忆网络融合多维度和多模态信息,最终生成视频内容的语言描述。实验仿真表明,所提方法与目前已有方法相比,在视频自动描述任务中取得了较好的效果。
In order to solve the problem of complex information representation in automatic video description tasks
a multi-dimensional and multi-modal visual feature extraction and fusion method was proposed.Firstly
multi-dimensional features such as static and dynamic attributes of the video sequence were extracted by transfer learning
and the image description algorithm was also used to extract the semantic information of the key frames in the video.By doing this
the video features extraction was carried out.Then
multi-layer long and short memory networks were used to fuse multi-dimensional and multi-modal information
and finally generated a language description of the video content.Compared with the existing methods
experimental simulations results show that the proposed method achieves better results in the video automatic description task.
KOJIMA A , IZUMI M , TAMURA T , et al . Generating natural language description of human behavior from video images [C ] // 15th International Conference on Pattern Recognition . ICPR , 2000 : 728 - 731 .
ZHAO B , LI X , LU X . CAM-RNN:co-attention model based RNN for video captioning [J ] . IEEE Transactions on Image Processing , 2019 , 28 ( 11 ): 5552 - 5564 .
PARK J , SONG C , HAN J . A study of evaluation metrics and datasets for video captioning [C ] // 2017 International Conference on Intelligent Informatics and Biomedical Sciences . ICIIBMS , 2017 : 172 - 175 .
YI B , YANG Y , FUMIN S , et al . Describing video with attention-based bidirectional LSTM [J ] . IEEE Transactions on Cybernetics , 2018 , 49 ( 7 ): 1 - 11 .
KRISHNA R , HATA K , REN F , et al . Dense-captioning events in videos [C ] // 2017 IEEE International Conference on Computer Vision . ICCV , 2017 : 706 - 715 .
SHEN Z , LI J , SU Z , et al . Weakly supervised dense video captioning [C ] // IEEE Conference on Computer Vision and Pattern Recognition . CVPR , 2017 : 1916 - 1924 .
GUADARRAMA S , KRISHNAMOORTHY N , MALKARNENKAR G , et al . Youtube2text:recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition [C ] // IEEE International Conference on Computer Vision . ICCV , 2013 : 2712 - 2719 .
ROHRBACH M , QIU W , TITOV I , et al . Translating video content to natural language descriptions [C ] // IEEE International Conference on Computer Vision . IEEE , 2013 : 433 - 440 .
KOJIMA A , TAMURA T , FUKUNAGA K . Natural language description of human activities from video images based on concept hierarchy of actions [J ] . International Journal of Computer Vision , 2002 , 50 ( 2 ): 171 - 184 .
THOMASON J , VENUGOPALAN S , GUADARRAMA S , et al . Integrating language and vision to generate natural language descriptions of videos in the wild [C ] // International Conference on Computational Linguistics . ICCL , 2014 : 1218 - 1227 .
JOHNSON M , SCHUSTER M , LE Q V , et al . Google’s multilingual neural machine translation system:enabling zero-shot translation [J ] . Transactions of the Association for Computational Linguistics , 2017 , 5 ( 2 ): 339 - 351 .
VENUGOPALAN S , XU H , DONAHUE J , et al . Translating videos to natural language using deep recurrent neural networks [C ] // 2015 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies . 2015 : 1494 - 1504 .
VENUGOPALAN S , ROHRBACH M , DONAHUE J , et al . Sequence to sequence-video to text [C ] // IEEE International Conference on Computer Vision . ICCV , 2015 : 4534 - 4542 .
YAO L , TORABI A , CHO K , et al . Describing videos by exploiting temporal structure [C ] // IEEE International Conference on Computer Vision . ICCV , 2015 : 4507 - 4515 .
HOCHREITER S , SCHMIDHUBER J . Long short-term memory [J ] . Neural Computation , 1997 , 9 ( 8 ): 1735 - 1780 .
JIN Q , CHEN J , CHEN S , et al . Describing videos using multi-modal fusion [C ] // The 24th ACM International Conference on Multimedia . ACM , 2016 : 1087 - 1091 .
CHEN Y , WANG S , ZHANG W , et al . Less is more:picking informative frames for video captioning [C ] // The European Conference on Computer Vision . ECCV , 2018 : 358 - 373 .
CHEN T H , LIAO Y H , CHUANG C Y , et al . Show,adapt and tell:adversarial training of cross-domain image captioner [C ] // IEEE International Conference on Computer Vision . ICCV , 2017 : 521 - 530 .
CHEN D L , DOLAN W B . Collecting highly parallel data for paraphrase evaluation [C ] // The 49th Annual Meeting of the Association for Computational Linguistics . ACL , 2011 : 190 - 200 .
XU J , MEI T , YAO T , et al . MSR-VTT:a large video description dataset for bridging video and language [C ] // IEEE Conference on Computer Vision and Pattern Recognition . CVPR , 2016 : 5288 - 5296 .
XU K , BA J , KIROS R , et al . Show,attend and tell:neural image caption generation with visual attention [J ] . Computer Science , 2015 , 2 ( 1 ): 2048 - 2057 .
DENG J , DONG W , SOCHER R , et al . Imagenet:a large-scale hierarchical image database [C ] // IEEE Conference on Computer Vision and Pattern Recognition . CVPR , 2009 : 248 - 255 .
ZHOU B , LAPEDRIZA A , KHOSLA A , et al . Places:a 10 million image database for scene recognition [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2017 , 40 ( 6 ): 1452 - 1464 .
CARREIRA J , ZISSERMAN A . Quo vadis,action recognition? a new model and the kinetics dataset [C ] // The IEEE Conference on Computer Vision and Pattern Recognition . CVPR , 2017 : 6299 - 6308 .
IOFFE S , SZEGEDY C . Batch normalization:accelerating deep network training by reducing internal covariate shift [C ] // International Conference on Machine Learning . ICML , 2015 : 448 - 456 .
HE K , ZHANG X , REN S , et al . Deep residual learning for image recognition [C ] // IEEE Conference on Computer Vision and Pattern Recognition . CVPR , 2016 : 770 - 778 .
CHEN X , FANG H , LIN T Y , et al . Microsoft coco captions:data collection and evaluation server [J ] . arXiv Preprint,arXiv:1504.00325 , 2015 .
VEDANTAM R , ZITNICK C L , PARIKH D , et al . Cider:consensus-based image description evaluation [C ] // IEEE Conference on Computer Vision and Pattern Recognition . CVPR , 2015 : 4566 - 4575 .
RAMANISHKA V , DAS A , PARK D H , et al . Multimodal video description [C ] // The 24th ACM International Conference on Multimedia . ACM , 2016 : 1092 - 1096 .
ZHANG X , GAO K , ZHANG Y , et al . Task-driven dynamic fusion:Reducing ambiguity in video description [C ] // IEEE Conference on Computer Vision and Pattern Recognition . CVPR , 2017 : 3713 - 3721 .
0
浏览量
729
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构