基于增强负例多粒度区分模型的视频动作识别研究

刘良振; 杨阳; 夏莹杰; 邝砾

doi:10.11959/j.issn.1000-436x.2024268

您当前的位置：

首页 >

文章列表页 >

基于增强负例多粒度区分模型的视频动作识别研究

学术论文 | 更新时间：2025-01-14

- 基于增强负例多粒度区分模型的视频动作识别研究
- Study on video action recognition based on augment negative example multi-granularity discrimination model
- 通信学报 2024年45卷第12期页码：28-43
- 作者机构：
  
  1.中南大学计算机学院，湖南长沙 410083
  2.杭州电子科技大学微电子研究院，浙江杭州 310005
  3.浙江大学计算机科学与技术学院，浙江杭州 310012
- 作者简介：
  
  [ "刘良振（1998- ），男，湖南邵阳人，中南大学博士生，主要研究方向为视频动作识别、视频行为监管。" ]
  [ "杨阳（1999- ），男，安徽合肥人，中南大学博士生，主要研究方向为代码生成、智能运维、多模态学习。" ]
  [ "夏莹杰（1982- ），男，浙江奉化人，博士，杭州电子科技大学特聘教授、浙江大学兼聘教授，主要研究方向为智能交通和信息安全。" ]
  [ "邝砾（1982- ），女，湖南长沙人，博士，中南大学教授、博士生导师，主要研究方向为智能软件工程与服务监管。" ]
- 基金信息：
  
  国家重点研发计划基金资助项目(2022YFF0902500)
- DOI：10.11959/j.issn.1000-436x.2024268
  中图分类号： TP391.41
- 收稿日期：2024-07-29，
  
  修回日期：2024-12-05，
  
  纸质出版日期：2024-12-25
- 稿件说明：
移动端阅览
刘良振,杨阳,夏莹杰等.基于增强负例多粒度区分模型的视频动作识别研究[J].通信学报,2024,45(12):28-43.

LIU Liangzhen,YANG Yang,XIA Yingjie,et al.Study on video action recognition based on augment negative example multi-granularity discrimination model[J].Journal on Communications,2024,45(12):28-43.
刘良振,杨阳,夏莹杰等.基于增强负例多粒度区分模型的视频动作识别研究[J].通信学报,2024,45(12):28-43. DOI： 10.11959/j.issn.1000-436x.2024268.

LIU Liangzhen,YANG Yang,XIA Yingjie,et al.Study on video action recognition based on augment negative example multi-granularity discrimination model[J].Journal on Communications,2024,45(12):28-43. DOI： 10.11959/j.issn.1000-436x.2024268.

摘要

为提升模型对视频动作的细粒度区分能力，提出一种基于对比学习的增强负例区分范式。通过为每个视频生成增强负例集合，以补充最难区分的视频-文本负例对。为了进一步区分正负例，基于该范式提出一种用于视频动作识别的多粒度区分模型。在该模型中，视频表征器通过引入文本正例特征引导视频特征提取，而正负语义区分器利用自注意力机制构建正负语义之间的自相关关系。该模型既能够实现模态间视频与增强负例集的粗粒度区分，还可以实现文本模态内正例与增强负例集的细粒度区分。实验结果表明，增强负例集能显著提升模型在细粒度类别标签上的识别能力，多粒度区分模型在Kinetics-400、HMDB51和UCF101数据集上的性能均优于当前较具代表性的方法。

Abstract

An augment negative example discrimination paradigm based on contrastive learning was proposed to improve the model’s fine-grained discrimination ability of video actions. The most challenging video-text negative pairs was generated

forming an augmented negative example set for each video sample. Based on this paradigm

a multi-granularity discrimination model for video action recognition was proposed to further distinguish between positive and negative examples. In this model

video features were extracted by the video representation module guided by textual positive examples

while self-correlation relationships between positive and negative semantics were established by the semantic discriminator equipped with a self-attention mechanism. Meanwhile

a coarse-grained distinction between the video modality and the augmented negative example set was achieved

while a fine-grained distinction between positive examples and the augmented negative example set within the text modality was also accomplished. Experimental results demonstrate that the augment negative set improves the model’s recognition ability on fine-grained class labels

and the multi-granularity discrimination model outperforms current state-of-the-art methods on the Kinetics-400

HMDB51 and UCF101 datasets.

关键词

Keywords

references

顾晓丹 , 吴文甲 , 凌振 . 用户密集环境下基于边缘智能的直播视频传输优化机制 [J ] . 通信学报 , 2023 , 44 ( 11 ): 55 - 66 .

GU X D , WU W J , LING Z . Live video transmission optimization mechanism based on edge intelligence in high client-density environment [J ] . Journal on Communications , 2023 , 44 ( 11 ): 55 - 66 .

毕春艳 , 刘越 . 基于深度学习的视频人体动作识别综述 [J ] . 图学学报 , 2023 , 44 ( 4 ): 625 - 639 .

BI C Y , LIU Y . A survey of video human action recognition based on deep learning [J ] . Journal of Graphics , 2023 , 44 ( 4 ): 625 - 639 .

张伟 , 王宇 , 陈新怡 , 等 . 基于图像块码本模型的监控视频背景参考帧生成方法 [J ] . 通信学报 , 2023 , 44 ( 1 ): 129 - 141 .

ZHANG W , WANG Y , CHEN X Y , et al . Background reference frame generation method for surveillance video based on image block codebook model [J ] . Journal on Communications , 2023 , 44 ( 1 ): 129 - 141 .

许可 , 李嘉怡 , 蒋兴浩 , 等 . 一种基于轮廓稀疏对抗的视频步态隐私保护算法 [J ] . 信息网络安全 , 2024 , 24 ( 1 ): 48 - 59 .

XU K , LI J Y , JIANG X H , et al . A video gait privacy protection algorithm based on sparse adversarial attack on silhouette [J ] . Netinfo Security , 2024 , 24 ( 1 ): 48 - 59 .

NAM B T , JUNGER D , CURIO C , et al . Towards human action recognition during surgeries using de-identified video data [J ] . Current Directions in Biomedical Engineering , 2022 , 8 ( 1 ): 109 - 112 .

KARPATHY A , TODERICI G , SHETTY S , et al . Large-scale video classification with convolutional neural networks [C ] // Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2014 : 1725 - 1732 .

SIMONYAN K , ZISSERMAN A . Two-stream convolutional networks for action recognition in videos [J ] . Neural Information Processing Systems , 2014 ( 1 ): 568 - 576 .

CARREIRA J , ZISSERMAN A . Quo vadis, action recognition? a new model and the kinetics dataset [C ] // Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2017 : 4724 - 4733 .

LIN J , GAN C , HAN S . TSM: temporal shift module for efficient video understanding [C ] // Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2019 : 7082 - 7092 .

RADFORD A , KIM J W , HALLACY C , et al . Learning transferable visual models from natural language supervision [C ] // International Conference on Machine Learning . New York : ACM Press , 2021 : 8748 - 8763 .

WANG M M , XING J Z , LIU Y . ActionCLIP: a new paradigm for video action recognition [J ] . arXiv Preprint , arXiv: 2109.08472 , 2021 .

KAY W , CARREIRA J , SIMONYAN K , et al . The kinetics human action video dataset [J ] . arXiv Preprint , arXiv: 1705.06950 , 2017 .

CAI T T , FRANKLE J , SCHWAB D J , et al . Are all negatives created equal in contrastive instance discrimination? [J ] . arXiv Preprint , arXiv: 2010.06682 , 2020 .

HORN B K P , SCHUNCK B G . Determining optical flow [J ] . Artificial Intelligence , 1981 , 17 ( 1-3 ): 185 - 203 .

WANG L M , QIAO Y , TANG X O . Action recognition with trajectory-pooled deep-convolutional descriptors [C ] // Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2015 : 4305 - 4314 .

DONAHUE J , HENDRICKS L A , GUADARRAMA S , et al . Long-term recurrent convolutional networks for visual recognition and description [C ] // Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2015 : 2625 - 2634 .

WANG L M , XIONG Y J , WANG Z , et al . Temporal segment networks: towards good practices for deep action recognition [C ] // Lecture Notes in Computer Science . Berlin : Springer , 2016 : 20 - 36 .

FEICHTENHOFER C , PINZ A , WILDES R P . Spatiotemporal multiplier networks for video action recognition [C ] // Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2017 : 7445 - 7454 .

JI S W , YANG M , YU K . 3D convolutional neural networks for human action recognition [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2013 , 35 ( 1 ): 221 - 231 .

TRAN D , BOURDEV L , FERGUS R , et al . Learning spatiotemporal features with 3D convolutional networks [C ] // Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2015 : 4489 - 4497 .

HARA K , KATAOKA H , SATOH Y . Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet? [C ] // Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2018 : 6546 - 6555 .

XIE S N , SUN C , HUANG J , et al . Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification [C ] // Lecture Notes in Computer Science . Berlin : Springer , 2018 : 318 - 335 .

FEICHTENHOFER C , FAN H Q , MALIK J , et al . SlowFast networks for video recognition [C ] // Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2019 : 6201 - 6210 .

FEICHTENHOFER C . X3D: expanding architectures for efficient video recognition [C ] // Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2020 : 200 - 210 .

ZHU Y , LI X Y , LIU C H , et al . A comprehensive study of deep video action recognition [J ] . arXiv Preprint , arXiv: 2012.06567 , 2020 .

PIERGIOVANNI A , ANGELOVA A , RYOO M S . Tiny video networks [J ] . arXiv Preprint , arXiv: 1910.06961 , 2019 .

JIANG B Y , WANG M M , GAN W H , et al . STM: Spatiotemporal and motion encoding for action recognition [C ] // Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2019 : 2000 - 2009 .

LI Y , JI B , SHI X T , et al . TEA: temporal excitation and aggregation for action recognition [C ] // Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2020 : 906 - 915 .

LIU Z Y , WANG L M , WU W , et al . TAM: temporal adaptive module for video recognition [C ] // Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2021 : 13688 - 13698 .

WANG L M , TONG Z , JI B , et al . TDN: temporal difference networks for efficient action recognition [C ] // Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2021 : 1895 - 1904 .

ZHANG Y T , BAI Y , WANG H , et al . Look more but care less in video recognition [J ] . Neural Information Processing Systems , 2022 , 35 : 30813 - 30825 .

JU C , HAN T D , ZHENG K H , et al . Prompting visual-language models for efficient video understanding [C ] // Lecture Notes in Computer Science . Berlin : Springer , 2022 : 105 - 124 .

NI B L , PENG H W , CHEN M H , et al . Expanding language-image pretrained models for general video recognition [C ] // Lecture Notes in Computer Science . Berlin : Springer , 2022 : 1 - 18 .

PAN J T , LIN Z Y , ZHU X T , et al . ST-adapter: parameter-efficient image-to-video transfer learning [J ] . Neural Information Processing Systems , 2022 , 35 : 26462 - 26477 .

ZHAO Y C , LUO C , TANG C X , et al . Streaming video model [C ] // Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2023 : 14602 - 14612 .

WU W H , SUN Z , OUYANG W L . Revisiting classifier: transferring vision-language models for video recognition [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2023 , 37 ( 3 ): 2847 - 2855 .

WU W H , WANG X H , LUO H P , et al . Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models [C ] // Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2023 : 6620 - 6630 .

WANG M M , XING J Z , JIANG B Y , et al . A multimodal, multi-task adapting framework for video action recognition [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 6 ): 5517 - 5525 .

KUEHNE H , JHUANG H , GARROTE E , et al . HMDB: a large video database for human motion recognition [C ] // Proceedings of the 2011 International Conference on Computer Vision . Piscataway : IEEE Press , 2011 : 2556 - 2563 .

SOOMRO K , ZAMIR A R , SHAH M . UCF101: a dataset of 101 human actions classes from videos in the wild [J ] . arXiv Preprint , arXiv: 1212.0402 , 2012 .

DENG J , DONG W , SOCHER R , et al . ImageNet: a large-scale hierarchical image database [C ] // Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2009 : 248 - 255 .

LIU Y Z , YUAN J S , TU Z G . Motion-driven visual tempo learning for video-based action recognition [J ] . IEEE Transactions on Image Processing , 2022 , 31 : 4104 - 4116 .

WANG M M , XING J Z , SU J , et al . Learning spatiotemporal and motion features in a unified 2D network for action recognition [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 3 ): 3347 - 3362 .

SHENG X X , LI K C , SHEN Z Q , et al . A progressive difference method for capturing visual tempos on action recognition [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2023 , 33 ( 3 ): 977 - 987 .

CHEN Y T , GE H W , LIU Y X , et al . AGPN: action granularity pyramid network for video action recognition [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2023 , 33 ( 8 ): 3912 - 3923 .

MOU Y T , JIANG X H , XU K , et al . Compressed video action recognition with dual-stream and dual-modal transformer [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2024 , 34 ( 5 ): 3299 - 3312 .

ZHENG Z W , YANG L , WANG Y L , et al . Dynamic spatial focus for efficient compressed video action recognition [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2024 , 34 ( 2 ): 695 - 708 .

LI Z L , LI J , MA Y Q , et al . Spatio-temporal adaptive network with bidirectional temporal difference for action recognition [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2023 , 33 ( 9 ): 5174 - 5185 .

ZHANG Y K , LI J , JIANG N , et al . Temporal transformer networks with self-supervision for action recognition [J ] . IEEE Internet of Things Journal , 2023 , 10 ( 14 ): 12999 - 13011 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于知识增强对比学习的长尾用户序列推荐算法

基于对比学习的图神经网络后门攻击防御方法

基于对比增量学习的细粒度恶意流量分类方法

基于对比学习的细粒度未知恶意流量分类方法