面向6G的跨模态信号重建技术

李昂; 陈建新; 魏昕; 周亮

doi:10.11959/j.issn.1000-436x.2022093

您当前的位置：

首页 >

文章列表页 >

面向6G的跨模态信号重建技术

专题：面向6G的智能至简网络关键技术 | 更新时间：2024-06-06

- 面向6G的跨模态信号重建技术
- 6G-oriented cross-modal signal reconstruction technology
- 通信学报 2022年43卷第6期页码：28-40
- 作者机构：
  
  1. 南京邮电大学通信与信息工程学院，江苏南京 210003
  2. 南京邮电大学宽带无线通信与传感网技术教育部重点实验室，江苏南京 210003
- 作者简介：
  
  [ "李昂（1995- ），男，河南周口人，南京邮电大学博士生，主要研究方向为多媒体通信、人工智能" ]
  [ "陈建新（1973- ），男，江苏南通人，博士，南京邮电大学副教授、硕士生导师，主要研究方向为无线通信、人机交互" ]
  [ "魏昕（1983- ），男，江苏南京人，博士，南京邮电大学教授、硕士生导师，主要研究方向为多媒体通信" ]
  [ "周亮（1981- ），男，安徽芜湖人，博士，南京邮电大学教授、博士生导师，主要研究方向为多媒体通信" ]
- 基金信息：
  
  国家自然科学基金资助项目(62071254);江苏高校优势学科建设工程基金资助项目
- DOI：10.11959/j.issn.1000-436x.2022093
  中图分类号： TP391
- 网络出版日期：2022-06，
  
  纸质出版日期：2022-06-25
- 稿件说明：
移动端阅览
李昂, 陈建新, 魏昕, 等. 面向6G的跨模态信号重建技术[J]. 通信学报, 2022,43(6):28-40.

Ang LI, Jianxin CHEN, Xin WEI, et al. 6G-oriented cross-modal signal reconstruction technology[J]. Journal on communications, 2022, 43(6): 28-40.
李昂, 陈建新, 魏昕, 等. 面向6G的跨模态信号重建技术[J]. 通信学报, 2022,43(6):28-40. DOI： 10.11959/j.issn.1000-436x.2022093.

Ang LI, Jianxin CHEN, Xin WEI, et al. 6G-oriented cross-modal signal reconstruction technology[J]. Journal on communications, 2022, 43(6): 28-40. DOI： 10.11959/j.issn.1000-436x.2022093.

摘要

目的：众所周知，包含音频、视频、触觉的多模态业务如混合现实、数字孪生、元宇宙等势必会成为6G时代下的杀手级应用，然而，该业务产生的大量多模态数据极易对现有通信系统的信号处理、传输、存储等造成负担。因此，为了满足用户沉浸式体验需求和保障低时延、高可靠、大容量的通信质量，迫切需要一种跨模态信号重建方案来减少传输数据量，以支持6G沉浸式多模态业务。

方法：首先，通过控制机器人触摸各种材质，构建了包含音、视、触信号的数据集VisTouch，为后续各种跨模态问题的研究奠定基础；其次，通过利用多模态信号间的语义关联性，设计一种普适的、稳健的端到端跨模态信号重建架构，包含特征提取模块、重建模块、评估模块3个部分，特征提取模块将源模态信号映射为公共语义空间中的语义特征向量，重建模块将此语义特征向量反变换为目标模态信号，2种模块的级联结构是跨越模态“壁垒”的关键，评估模块从语义维度、信号本身的时空维度对重建质量进行评估，并在框架训练过程中反馈优化信息给特征提取模块与重建模块，形成闭环回路，通过不断迭代实现精准信号重建；再次，以通过视频信号重建触觉信号为例，构建视频辅助的触觉重建模型，包括基于3D CNN的视频特征提取网络，基于全卷积网络的GAN生成网络与基于CNN的GAN辨别网络；进一步地，设计了一种遥操作平台，将所构建触觉重建模型部署到编解码器中，以实际验证模型的运行效率；最后，通过实验结果验证跨模态信号重建架构的可靠性以及触觉重建模型的准确性。

结果：所构建的VisTouch数据集涉及音频、视频、触觉三种模态，包含47种生活中常见的片状样本，数据采集手段为脚本控制机械手滑动触摸各种材质，并记录滑动触摸过程中指尖与材质摩擦产生的滑动摩擦力作为触觉信号，同时利用高清摄像头及挂载在机械手的单向拾音器采集视频、音频信号，并用时间戳进行同步；所构建视频辅助的触觉重建模型在VisTouch数据集上的平均绝对误差与准确度分别达到0.0135与0.78，为了将所提跨模态信号重建框架落地到实际应用场景，利用机械人、英伟达开发板进一步搭建了一种遥操作平台，用于实现工业场景中远程抓取物体的任务，该平台运行结果表明，实际平均绝对误差为0.0126，端到端总时延127ms，重建模型时延98ms，同时采用问卷调查方式评估用户满意度，其中触觉真实性满意度均值为4.43，方差为0.72，时延满意度均值为3.87，方差为1.07。

结论：数据集运行结果充分证明了所构建VisTouch数据集的实用性和视频辅助下的触觉重建模型的准确性，同时遥操作平台实际测试结果表明，用户认为该模型所生成出的触觉信号比较贴近实际信号，但对算法运行时间满意度一般，即本模态复杂度有待进一步优化。

Abstract

Objectives:It is well known that multimodal services containing audio

video and haptics such as mixed reality

digital twin and metaverse are bound to become killer applications in the 6G era

however

the large amount of multimodal data generated by such services is highly likely to burden the signal processing

transmission and storage of existing communication systems. Therefore

a cross-modal signal reconstruction scheme is urgently needed to reduce the amount of transmitted data to support 6G immersive multimodal services in order to meet the user's immersive experience requirements and guarantee low latency

high reliability and high capacity communication quality.

Methods:Firstly

by controlling the robot to touch various materials

a dataset containing audio

visual and touch signals

VisTouch

is constructed to lay the foundation for subsequent research on various cross-modal problems; secondly

by exploiting the semantic correlation between multimodal signals

a universal and robust end-to-end cross-modal signal reconstruction architecture is designed

comprising three parts: a feature extraction module

a reconstruction module and an evaluation module. The feature extraction module maps the source modal signals into a semantic feature vector in the common semantic space

and the reconstruction module inverse transforms this semantic feature vector into the target modal signal.The evaluation module evaluates the reconstruction quality in semantic and spatio-temporal dimensions

and feeds the optimization information to the feature extraction module and the reconstruction module during the training process of the framework

forming a closed-loop loop to achieve accurate signal reconstruction through continuous iteration. Further

a teleoperated platform is designed to deploy the constructed haptic reconstruction model into the codec to actually verify the operational efficiency of the model; finally

the reliability of the cross-modal signal reconstruction architecture and the accuracy of the haptic reconstruction model are verified by experimental results.

Results: The constructed VisTouch dataset involves three modalities: audio

video and haptics

and contains 47 common slices of life samples. The average absolute error and accuracy of the constructed video-assisted haptic reconstruction model on the VisTouch dataset reached 0.0135 and 0.78 respectively. In order to implement the proposed cross-modal signal reconstruction framework into practical application scenarios

a teleoperation platform was further built using the robot and Nvidia development board for the industrial scenario of The results of running on this platform show that the actual mean absolute error is 0.0126

the total end-to-end delay is 127ms and the reconstruction model delay is 98ms.A questionnaire was also used to assess user satisfaction

where the mean value of haptic realism satisfaction is 4.43 with a variance of 0.72 and the mean value of time delay satisfaction is 3.87 with a variance of 1.07.

Conclusions: The results of the dataset runs fully demonstrate the practicality of the constructed VisTouch dataset and the accuracy of the video-assisted haptic reconstruction model

while the actual test results of the teleoperated platform indicate that users consider the haptic signals generated by the model to be closer to the actual signals

but are generally satisfied with the running time of the algorithm

i.e. the complexity of this modality needs further optimization.

关键词

Keywords

references

中国信息通信研究院 . 6G 总体愿景与潜在关键技术白皮书 [R ] . 2021 .

China Academy of Information and Communications Technology .. 6G overall vision and potential key technology white paper [R ] . 2021 .

VAN D B D , GLANS R , KONING D D , et al . Challenges in haptic communications over the tactile Internet [J ] . IEEE Access , 2017 , 5 : 23502 - 23518 .

ZHOU L , WU D , CHEN J X , et al . Cross-modal collaborative communications [J ] . IEEE Wireless Communications , 2020 , 27 ( 2 ): 112 - 117 .

WEI X , ZHOU L . AI-enabled cross-modal communications [J ] . IEEE Wireless Communications , 2021 , 28 ( 4 ): 182 - 189 .

高赟 , 魏昕 , 周亮 . 跨模态通信理论及关键技术初探 [J ] . 中国传媒大学学报(自然科学版) , 2021 , 28 ( 1 ): 55 - 63 .

GAO Y , WEI X , ZHOU L . Preliminary study on theory and key technology of cross-modal communications [J ] . Journal of Communication University of China (Science and Technology) , 2021 , 28 ( 1 ): 55 - 63 .

KRIZHEVSKY A , SUTSKEVER I , HINTON G E . ImageNet classification with deep convolutional neural networks [J ] . Communications of the ACM , 2017 , 60 ( 6 ): 84 - 90 .

SIMONYAN K , ZISSERMAN A . Very deep convolutional networks for large-scale image recognition [J ] . arXiv Preprint,arXiv:1409.1556 , 2014 .

HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C ] // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway:IEEE Press , 2016 : 770 - 778 .

BAZZICA A , VAN GEMERT J C , LIEM C C S , et al . Vision-based detection of acoustic timed events:a case study on clarinet note onsets [J ] . arXiv Preprint,arXiv:1706.09556 , 2017 .

LI B C , LIU X Z , DINESH K , et al . Creating a multitrack classical music performance dataset for multimodal music analysis:challenges,insights,and applications [J ] . IEEE Transactions on Multimedia , 2019 , 21 ( 2 ): 522 - 535 .

ZHAO H , GAN C , ROUDITCHENKO A , et al . The sound of pixels [C ] // Proceedings of the European Conference on Computer Vision . Berlin:Springer , 2018 : 570 - 586 .

MONTESINOS J F , SLIZOVSKAIA O , HARO G . Solos:a dataset for audio-visual music analysis [C ] // Proceedings of 2020 IEEE 22nd International Workshop on Multimedia Signal Processing . Piscataway:IEEE Press , 2020 : 1 - 6 .

KURMI V K , BAJAJ V , PATRO B N , et al . Collaborative learning to generate audio-video jointly [C ] // Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2021 : 4180 - 4184 .

ROTH J , CHAUDHURI S , KLEJCH O , et al . Ava active speaker:an audio-visual dataset for active speaker detection [C ] // Proceedings of 2020 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2020 : 4492 - 4496 .

TSUCHIDA S , FUKAYAMA S , HAMASAKI M , et al . AIST dance video database:multi-genre,multi-dancer,and multi-camera database for dance information processing [C ] // Proceedings of the 20th International Society for Music Information Retrieval Conference .[S.l.:s.n. ] , 2019 : 501 - 510 .

LI R L , YANG S , ROSS D A , et al . AI choreographer:music conditioned 3D dance generation with AIST++ [C ] // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway:IEEE Press , 2021 : 13381 - 13392 .

HONG S , IM W , YANG H S . Content-based video-music retrieval using soft intra-modal structure constraint [J ] . arXiv Preprint,arXiv:1704.06761 , 2017 .

LI Y Z , ZHU J Y , TEDRAKE R , et al . Connecting touch and vision via cross-modal prediction [C ] // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway:IEEE Press , 2019 : 10601 - 10610 .

YUAN W Z , DONG S Y , ADELSON E H . GelSight:high-resolution robot tactile sensors for estimating geometry and force [J ] . Sensors (Basel,Switzerland) , 2017 , 17 ( 12 ): 2762 .

SUNDARAM S , KELLNHOFER P , LI Y Z , et al . Learning the signatures of the human grasp using a scalable tactile glove [J ] . Nature , 2019 , 569 ( 7758 ): 698 - 702 .

DUAN B , WANG W , TANG H , et al . Cascade attention guided residue learning GAN for cross-modal translation [C ] // Proceedings of 2020 25th International Conference on Pattern Recognition (ICPR) . Piscataway:IEEE Press , 2021 : 1336 - 1343 .

HAO W L , ZHANG Z X , GUAN H . CMCGAN:a uniform framework for cross-modal visual-audio mutual generation [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Palo Alto:AAAI Press , 2018 : 6886 - 6893 .

CHATTERJEE M , CHERIAN A . Sound2Sight:generating visual dynamics from sound and context [C ] // European Conference on Computer Vision . Berlin:Springer , 2020 : 701 - 719 .

WEI X , SHI Y Y , ZHOU L . Haptic signal reconstruction for cross-modal communications [J ] . IEEE Transactions on Multimedia , 2021 :doi.org/10.1109/TMM.2021.3119860.

王万良 , 李卓蓉 . 生成式对抗网络研究进展 [J ] . 通信学报 , 2018 , 39 ( 2 ): 135 - 148 .

WANG W L , LI Z R . Advances in generative adversarial network [J ] . Journal on Communications , 2018 , 39 ( 2 ): 135 - 148 .

浏览量

746

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

6G知识体系构建：面向全域全场景的学术知识挖掘及其按需应用

面向6G的深度图像语义通信模型

6G通信感知一体化网络的感知算法研究与优化

意图抽象与知识联合驱动的6G内生智能网络架构

面向零功耗物联网的反向散射通信综述