基于跨模态融合与双曲图注意力机制的视频异常检测

姜迪; 赖惠成; 汪烈军

doi:10.11959/j.issn.1000-436x.2025110

您当前的位置：

首页 >

文章列表页 >

基于跨模态融合与双曲图注意力机制的视频异常检测

学术论文 | 更新时间：2025-07-04

- 基于跨模态融合与双曲图注意力机制的视频异常检测
- Video anomaly detection via cross-modal fusion and hyperbolic graph attention mechanism
- 通信学报 2025年46卷第6期页码：136-152
- 作者机构：
  
  1.新疆大学计算机科学与技术学院，新疆乌鲁木齐 830017
  2.新疆大学新疆维吾尔自治区信号检测与处理重点实验室，新疆乌鲁木齐 830017
  3.丝路多语言认知计算国际合作联合实验室，新疆乌鲁木齐 830017
- 作者简介：
  
  [ "姜迪（1997- ），男，山东济宁人，新疆大学博士生，主要研究方向为视频异常检测、深度学习、目标检测等。" ]
  [ "赖惠成（1963- ），男，四川德阳人，新疆大学教授、博士生导师，主要研究方向为视频/图像信息处理、图像理解与识别等。" ]
  [ "汪烈军（1975- ），男，四川眉山人，博士，新疆大学教授、博士生导师，主要研究方向为视频通信处理、图像识别与处理等。" ]
- 基金信息：
  
  国家自然科学基金联合基金资助项目(U1903213);新疆维吾尔自治区重点研发计划基金资助项目(2022B01008)
- DOI：10.11959/j.issn.1000-436x.2025110
  中图分类号： TP391.41
- 收稿日期：2025-04-17，
  
  修回日期：2025-06-04，
  
  纸质出版日期：2025-06-25
- 稿件说明：
移动端阅览
姜迪,赖惠成,汪烈军.基于跨模态融合与双曲图注意力机制的视频异常检测[J].通信学报,2025,46(06):136-152.

JIANG Di,LAI Huicheng,WANG Liejun.Video anomaly detection via cross-modal fusion and hyperbolic graph attention mechanism[J].Journal on Communications,2025,46(06):136-152.
姜迪,赖惠成,汪烈军.基于跨模态融合与双曲图注意力机制的视频异常检测[J].通信学报,2025,46(06):136-152. DOI： 10.11959/j.issn.1000-436x.2025110.

JIANG Di,LAI Huicheng,WANG Liejun.Video anomaly detection via cross-modal fusion and hyperbolic graph attention mechanism[J].Journal on Communications,2025,46(06):136-152. DOI： 10.11959/j.issn.1000-436x.2025110.

摘要

针对视频异常检测中模态信息不平衡、视听噪声不平均以及模态异步等问题，提出了一个动态跨模态融合模块与双曲图注意力机制融合的多模态视频异常检测方法CM-HVAD，以准确检测异常行为。首先，提出了一种新的动态跨模态融合模块，动态压缩多模态数据特征，自主学习跨模态权重，动态平衡视觉特征和音视频特征并进行融合增强。然后，针对多模态数据中存在的模态异步问题，提出了模态一致性对齐模块，按时间帧序列对齐模态语义，确保多模态数据在时间和语义上的一致性。最后，引入了双曲图注意力机制，通过双曲空间的模式分离特性，有效捕捉正常和异常表示之间的层次关系，从而提高检测准确率。实验结果表明，所提方法在XD-Violence上AP达到了86.47%，在UCF-Crime上AUC达到了87.12%，性能优于基线方法。

Abstract

To address the challenges of modality information imbalance

non-uniform audiovisual noise

and modality asynchrony in video anomaly detection

a multimodal video anomaly detection method called CM-HVAD was proposed for accurate anomaly detection. Firstly

a novel dynamic cross-modal fusion module was introduced to dynamically compress and reweight multimodal features through autonomous learning of cross-modal weights

thereby achieving balanced and enhanced fusion of visual and audio features. Secondly

to address the issue of modal asynchrony in multimodal data

a modal consistency alignment module was proposed

which aligned modal semantics along the temporal frame sequence to ensure both temporal and semantic consistency in multimodal data. Finally

a hyperbolic graph attention mechanism was incorporated to effectively capture the hierarchical relationships between normal and abnormal representations through the pattern separation property of hyperbolic space

thereby improving detection accuracy. The results show that the proposed method achieves 86.47% AP on XD-Violence and 87.12% AUC on UCF-Crime

outperforming baseline methods.

关键词

Keywords

references

NAYAK R , PATI U C , DAS S K . A comprehensive review on deep learning-based methods for video anomaly detection [J ] . Image and Vision Computing , 2021 , 106 : 104078 .

SULTANI W , CHEN C , SHAH M . Real-world anomaly detection in surveillance videos [C ] // Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2018 : 6479 - 6488 .

何平 , 李刚 , 李慧斌 . 基于深度学习的视频异常检测方法综述 [J ] . 计算机工程与科学 , 2022 , 44 ( 9 ): 1620 - 1629 .

HE P , LI G , LI H B . A survey on deep learning based video anomaly detection [J ] . Computer Engineering & Science , 2022 , 44 ( 9 ): 1620 - 1629 .

NAKAHATA M T , THOMAZ L A , SILVA A F D , et al . Anomaly detection with a moving camera using spatio-temporal codebooks [J ] . Multidimensional Systems and Signal Processing , 2018 , 29 ( 3 ): 1025 - 1054 .

WEI D L , LIU Y , ZHU X G , et al . MSAF: multimodal supervise-attention enhanced fusion for video anomaly detection [J ] . IEEE Signal Processing Letters , 2022 , 29 : 2178 - 2182 .

FLABOREA A , COLLORONE L , MELENDUGNO G M D A D , et al . Multimodal motion conditioned diffusion model for skeleton-based video anomaly detection [C ] // Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2023 : 10284 - 10295 .

PAULRAJ S , VAIRAVASUNDARAM S . M 2 VAD: multiview multimodality transformer-based weakly supervised video anomaly detection [J ] . Image and Vision Computing , 2024 , 149 : 105139 .

WU P , LIU J , HE X T , et al . Toward video anomaly retrieval from video anomaly detection: new benchmarks and model [J ] . IEEE Transactions on Image Processing , 2024 , 33 : 2213 - 2225 .

FENG C , CHEN Z Y , OWENS A . Self-supervised video forensics by audio-visual anomaly detection [C ] // Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2023 : 10491 - 10503 .

DEV P P , HAZARI R , DAS P . MCANet: multimodal caption aware training-free video anomaly detection via large language model [C ] // International Conference on Pattern Recognition . Berlin : Springer , 2024 : 362 - 379 .

仇媛 , 常相茂 , 仇倩 , 等 . 基于长短期记忆网络和滑动窗口的流数据异常检测方法 [J ] . 计算机应用 , 2020 , 40 ( 5 ): 1335 - 1339 .

QIU Y , CHANG X M , QIU Q , et al . Stream data anomaly detection method based on long short-term memory network and sliding window [J ] . Journal of Computer Applications , 2020 , 40 ( 5 ): 1335 - 1339 .

朱张莉 , 饶元 , 吴渊 , 等 . 注意力机制在深度学习中的研究进展 [J ] . 中文信息学报 , 2019 , 33 ( 6 ): 1 - 11 .

ZHU Z L , RAO Y , WU Y , et al . Research progress of attention mechanism in deep learning [J ] . Journal of Chinese Information Processing , 2019 , 33 ( 6 ): 1 - 11 .

ASAD M , YANG J , HE J , et al . Multi-frame feature-fusion-based model for violence detection [J ] . The Visual Computer , 2021 , 37 ( 6 ): 1415 - 1431 .

KUPPUSAMY P , HARIKA C . Human action recognition using CNN and LSTM-RNN with attention model [J ] . International Journal of Innovative Technology and Exploring Engineering , 2019 , 8 : 1639 - 1643 .

陈佛计 , 朱枫 , 吴清潇 , 等 . 生成对抗网络及其在图像生成中的应用研究综述 [J ] . 计算机学报 , 2021 , 44 ( 2 ): 347 - 369 .

CHEN F J , ZHU F , WU Q X , et al . A survey about image generation with generative adversarial nets [J ] . Chinese Journal of Computers , 2021 , 44 ( 2 ): 347 - 369 .

张文林 , 刘雪鹏 , 牛铜 , 等 . 基于正样本对比与掩蔽重建的自监督语音表示学习 [J ] . 通信学报 , 2022 , 43 ( 7 ): 163 - 171 .

ZHANG W L , LIU X P , NIU T , et al . Self-supervised speech representation learning based on positive sample comparison and masking reconstruction [J ] . Journal on Communications , 2022 , 43 ( 7 ): 163 - 171 .

WU J C , HSIEH H Y , CHEN D J , et al . Self-supervised sparse representation for video anomaly detection [C ] // European Conference on Computer Vision . Berlin : Springer , 2022 : 729 - 745 .

LIU R K , LIU W M , DUAN M F , et al . MemFormer: a memory based unified model for anomaly detection on metro railway tracks [J ] . Expert Systems with Applications , 2024 , 237 : 121509 .

杨静 , 吴成茂 , 周流平 . 基于全局-局部自注意力网络的视频异常检测方法 [J ] . 通信学报 , 2023 , 44 ( 8 ): 241 - 250 .

YANG J , WU C M , ZHOU L P . Novel video anomaly detection method based on global-local self-attention network [J ] . Journal on Communications , 2023 , 44 ( 8 ): 241 - 250 .

PEIXOTO B M , LAVI B , DIAS Z , et al . Harnessing high-level concepts, visual, and auditory features for violence detection in videos [J ] . Journal of Visual Communication and Image Representation , 2021 , 78 : 103174 .

ULLAH W , HUSSAIN T , KHAN Z A , et al . Intelligent dual stream CNN and echo state network for anomaly detection [J ] . Knowledge-Based Systems , 2022 , 253 : 109456 .

QASIM M , VERDU E . Video anomaly detection system using deep convolutional and recurrent models [J ] . Results in Engineering , 2023 , 18 : 101026 .

KUMARI P , BEDI A K , SAINI M . Multimedia datasets for anomaly detection: a review [J ] . Multimedia Tools and Applications , 2024 , 83 ( 19 ): 56785 - 56835 .

PHAM L , NGUYEN T , LAM P , et al . Toolchain for comprehensive audio/video analysis using deep learning based multimodal approach: use case of riot or violent context detection [C ] // Proceedings of the 2024 International Conference on Content-Based Multimedia Indexing (CBMI) . Piscataway : IEEE Press , 2024 : 1 - 4 .

PU Y J , WU X Y , WANG S J , et al . Semantic multimodal violence detection based on local-to-global embedding [J ] . Neurocomputing , 2022 , 514 : 148 - 161 .

JAAFAR N , LACHIRI Z . Multimodal fusion methods with deep neural networks and meta-information for aggression detection in surveillance [J ] . Expert Systems with Applications , 2023 , 211 : 118523 .

WU Y L , MAO Z Y , YU C Y , et al . Enhancing weakly supervised anomaly detection in surveillance videos: the CLIP-augmented bimodal memory enhanced network [C ] // Proceedings of the 2024 18th International Conference on Control, Automation, Robotics and Vision (ICARCV) . Piscataway : IEEE Press , 2024 : 756 - 762 .

VELIČKOVIĆ P , CUCURULL G , CASANOVA A , et al . Graph attention networks [J ] . arXiv Preprint , arXiv: 1710.10903 , 2017 .

CORSO G , STARK H , JEGELKA S , et al . Graph neural networks [J ] . Nature Reviews Methods Primers , 2024 , 4 : 17 .

ZHU H L , QIAO K X , XU Z G . Video anomaly behavior detection method based on attention-enhanced graph convolution and normalizing flows [J ] . Signal, Image and Video Processing , 2025 , 19 ( 5 ): 352 .

SONG W F , LI S , CHANG T , et al . Dynamic attention augmented graph network for video accident anticipation [J ] . Pattern Recognition , 2024 , 147 : 110071 .

CHIRANJEEVI V R , MALATHI D . Anomaly graph: leveraging dynamic graph convolutional networks for enhanced video anomaly detection in surveillance and security applications [J ] . Neural Computing and Applications , 2024 , 36 ( 20 ): 12011 - 12028 .

REYNOLDS D A . Gaussian mixture models [J ] . Encyclopedia of Biometrics , 2009 , 741 ( 3 ): 659 - 663 .

赵仲秋 , 季海峰 , 高隽 , 等 . 基于稀疏编码多尺度空间潜在语义分析的图像分类 [J ] . 计算机学报 , 2014 , 37 ( 6 ): 1251 - 1260 .

ZHAO Z Q , JI H F , GAO J , et al . Sparse coding based multi-scale spatial latent semantic analysis for image classification [J ] . Chinese Journal of Computers , 2014 , 37 ( 6 ): 1251 - 1260 .

LIU W , LUO W X , LIAN D Z , et al . Future frame prediction for anomaly detection-a new baseline [C ] // Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2018 : 6536 - 6545 .

LI X D , LANG Y N , CHEN Y F , et al . Sharp multiple instance learning for DeepFake video detection [C ] // Proceedings of the 28th ACM International Conference on Multimedia . New York : ACM Press , 2020 : 1864 - 1872 .

贾香恩 , 董一鸿 , 朱锋 , 等 . 异构图卷积网络研究进展 [J ] . 计算机工程与应用 , 2021 , 57 ( 9 ): 36 - 49 .

JIA X E , DONG Y H , ZHU F , et al . Research progress of heterogeneous graph convolutional networks [J ] . Computer Engineering and Applications , 2021 , 57 ( 9 ): 36 - 49 .

LUO W X , LIU W , GAO S H . Normal graph: spatial temporal graph convolutional networks based prediction network for skeleton based video anomaly detection [J ] . Neurocomputing , 2021 , 444 : 332 - 337 .

肖进胜 , 申梦瑶 , 江明俊 , 等 . 融合包注意力机制的监控视频异常行为检测 [J ] . 自动化学报 , 2022 , 48 ( 12 ): 2951 - 2959 .

XIAO J S , SHEN M Y , JIANG M J , et al . Abnormal behavior detection algorithm with video-bag attention mechanism in surveillance video [J ] . Acta Automatica Sinica , 2022 , 48 ( 12 ): 2951 - 2959 .

REHMAN A U , ULLAH H S , FAROOQ H , et al . Multi-modal anomaly detection by using audio and visual cues [J ] . IEEE Access , 2021 , 9 : 30587 - 30603 .

KUMARI P , SAINI M . An adaptive framework for anomaly detection in time-series audio-visual data [J ] . IEEE Access , 2022 , 10 : 36188 - 36199 .

GHADIYA A , KAR P , CHUDASAMA V , et al . Cross-modal fusion and attention mechanism for weakly supervised video anomaly detection [C ] // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) . Piscataway : IEEE Press , 2024 : 1965 - 1974 .

ALMARRI S , ZAHEER M Z , NANDAKUMAR K . A multi-head approach with shuffled segments for weakly-supervised video anomaly detection [C ] // Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW) . Piscataway : IEEE Press , 2024 : 132 - 142 .

ZHOU H L , HE L F , CHEN B Y , et al . Multi-modal diagnosis of Alzheimer’s disease using interpretable graph convolutional networks [J ] . IEEE Transactions on Medical Imaging , 2025 , 44 ( 1 ): 142 - 153 .

JIA X G , JIANG M , DONG Y H , et al . Multimodal heterogeneous graph attention network [J ] . Neural Computing and Applications , 2023 , 35 ( 4 ): 3357 - 3372 .

WU P , LIU J , SHI Y J , et al . Not only look, but also listen: learning multimodal violence detection under weak supervision [C ] // European Conference on Computer Vision . Berlin : Springer , 2020 : 322 - 339 .

WU P , LIU J . Learning causal temporal relation and feature discrimination for anomaly detection [J ] . IEEE Transactions on Image Processing , 2021 , 30 : 3513 - 3527 .

TIAN Y , PANG G S , CHEN Y H , et al . Weakly-supervised video anomaly detection with robust temporal feature magnitude learning [C ] // Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2021 : 4955 - 4966 .

ZHANG C , LI G R , QI Y K , et al . Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection [C ] // Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2023 : 16271 - 16280 .

ZHOU H , YU J Q , YANG W . Dual memory units with uncertainty regulation for weakly supervised video anomaly detection [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2023 , 37 ( 3 ): 3769 - 3777 .

ZANELLA L , MENAPACE W , MANCINI M , et al . Harnessing large language models for training-free video anomaly detection [C ] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2024 : 18527 - 18536 .

JAIN Y , DABOUEI A , XU M . Cross-domain learning for video anomaly detection with limited supervision [C ] // European Conference on Computer Vision . Berlin : Springer , 2024 : 468 - 484 .

WU P , LIU X T , LIU J . Weakly supervised audio-visual violence detection [J ] . IEEE Transactions on Multimedia , 2022 , 25 : 1674 - 1685 .

TAN W J , YAO Q , LIU J F . Overlooked video classification in weakly supervised video anomaly detection [C ] // Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW) . Piscataway : IEEE Press , 2024 : 212 - 220 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于全局-局部自注意力网络的视频异常检测方法