基于正样本对比与掩蔽重建的自监督语音表示学习

张文林; 刘雪鹏; 牛铜; 陈琦; 屈丹

doi:10.11959/j.issn.1000-436x.2022142

您当前的位置：

首页 >

文章列表页 >

基于正样本对比与掩蔽重建的自监督语音表示学习

学术论文 | 更新时间：2024-06-05

- 基于正样本对比与掩蔽重建的自监督语音表示学习
- Self-supervised speech representation learning based on positive sample comparison and masking reconstruction
- 通信学报 2022年43卷第7期页码：163-171
- 作者机构：
  
  信息工程大学信息系统工程学院，河南郑州 450001
- 作者简介：
  
  [ "张文林（1982- ），男，湖北黄冈人，博士，信息工程大学副教授，主要研究方向为语音信号处理、语音识别、机器学习" ]
  [ "刘雪鹏（1996- ），男，山东泰安人，信息工程大学硕士生，主要研究方向为智能信息处理、无监督学习、语音表示学习" ]
  [ "牛铜（1984- ），男，河南安阳人，博士，信息工程大学副教授，主要研究方向为深度学习、语音信号处理和语音识别" ]
  [ "陈琦（1974- ），男，河南郑州人，信息工程大学副教授，主要研究方向为语音信号处理、语音识别和音频水印" ]
  [ "屈丹（1974- ），女，吉林九台人，博士，信息工程大学教授，主要研究方向为机器学习、深度学习和语音识别" ]
- 基金信息：
  
  国家自然科学基金资助项目(61673395);国家自然科学基金资助项目(62171470)
- DOI：10.11959/j.issn.1000-436x.2022142
  中图分类号： TN912.34
- 网络出版日期：2022-06，
  
  纸质出版日期：2022-07-25
- 稿件说明：
移动端阅览
张文林, 刘雪鹏, 牛铜, 等. 基于正样本对比与掩蔽重建的自监督语音表示学习[J]. 通信学报, 2022,43(7):163-171.

Wenlin ZHANG, Xuepeng LIU, Tong NIU, et al. Self-supervised speech representation learning based on positive sample comparison and masking reconstruction[J]. Journal on communications, 2022, 43(7): 163-171.
张文林, 刘雪鹏, 牛铜, 等. 基于正样本对比与掩蔽重建的自监督语音表示学习[J]. 通信学报, 2022,43(7):163-171. DOI： 10.11959/j.issn.1000-436x.2022142.

Wenlin ZHANG, Xuepeng LIU, Tong NIU, et al. Self-supervised speech representation learning based on positive sample comparison and masking reconstruction[J]. Journal on communications, 2022, 43(7): 163-171. DOI： 10.11959/j.issn.1000-436x.2022142.

摘要

针对现有基于对比预测的自监督语音表示学习方法在训练时需要构建大量负样本，其学习效果依赖于大批次训练，需要耗费大量计算资源的问题，提出了一种仅使用正样本进行语音对比学习的方法，并将其与掩蔽重建任务相结合得到一种多任务自监督语音表示学习方法，在降低训练复杂度的同时提高语音表示学习的性能。其中，正样本对比学习任务，借鉴图像自监督表示学习中SimSiam方法的思想，采用孪生网络架构对原始语音信号进行两次数据增强，并使用相同的编码器进行处理，将一个分支经过一个前向网络，另一个分支使用梯度停止策略，调整模型参数以最大化2个分支输出的相似度。整个训练过程中不需要构造负样本，可采用小批次进行训练，大幅提高了学习效率。使用 LibriSpeech 语料库进行自监督表示学习，并在多种下游任务中进行微调测试，对比实验表明，所提方法得到的模型在多个任务中均达到或者超过了现有主流语音表示学习模型的性能。

Abstract

To solve the problem that existing contrastive prediction based self-supervised speech representation learning methods need to construct a large number of negative samples

and their performance depends on large training batches

requiring a lot of computing resources

a new speech representation learning method based on contrastive learning using only positive samples was proposed.Combined with reconstruction loss

the proposed method could obtain better representation with lower training cost.The proposed method was inspired by the idea of the SimSiam method in image self-supervised representation learning.Using the siamese network architecture

two random augmentations of the input speech signals were processed by the same encoder network

then a feed-forward network was applied on one side

and a stop-gradient operation was applied on the other side.The model was trained to maximize the similarity between two sides.During training processing

negative samples were not required

so small batch size could be used and training efficiency was improved.Experimental results show that the representation model obtained by the new method achieves or exceeds the performance of existing mainstream speech representation learning models in multiple downstream tasks.

关键词

Keywords

references

陈虹洁 . 面向低资源场景的语音表示学习及其应用 [D ] . 西安:西北工业大学 , 2018 .

CHEN H J . Low-resource speech representation learning and its applications [D ] . Xi’an:Northwestern Polytechnical University , 2018 .

朱毅 . 基于深度学习的表示学习算法研究 [D ] . 合肥:合肥工业大学 , 2018 .

ZHU Y . Research on deep learning-based representation learning algorithms [D ] . Hefei:Hefei University of Technology , 2018 .

刘雪鹏 , 张文林 . 自监督语音表示学习综述 [C ] // 2021 年第十六届全国人机语音通信学术会议录,北京:中国中文信息学会 , 2021 : 284 - 293 .

LIU X P , ZHANG W L . An overview of self-supervised speech representation learning [C ] // 2021 National Conference on Man-Machine Speech Communication 2021 . Beijing:Chinese Information Processing Society of China , 2021 : 284 - 293 .

YANG S W , CHI P H , CHUANG Y S , et al . SUPERB:speech processing universal performance benchmark [C ] // Proceedings of Interspeech 2021 . Piscataway:IEEE Press , 2021 : 1194 - 1198 .

OORD A V D , LI Y Z , VINYALS O . Representation learning with contrastive predictive coding [J ] . arXiv Preprint,arXiv:1807.03748 , 2018 .

SCHNEIDER S , BAEVSKI A , COLLOBERT R , et al . wav2vec:unsupervised pre-training for speech recognition [C ] // Proceedings of Interspeech 2019 . Piscataway:IEEE Press , 2019 : 3465 - 3469 .

BAEVSKI A , SCHNEIDER S , AULI M . vq-wav2vec:self-supervised learning of discrete speech representations [J ] . arXiv Preprint,arXiv:1910.05453 , 2019 .

BAEVSKI A , ZHOU H , MOHAMED A , et al . wav2vec 2.0:a framework for self-supervised learning of speech representations [J ] . Advances in Neural Information Processing Systems , 2020 , 33 : 12449 - 12460 .

GRILL J B , STRUB F , ALTCHÉ F , , et al . Bootstrap your own latent-a new approach to self-supervised learning [J ] . Advances in Neural Information Processing Systems , 2020 , 33 : 21271 - 21284 .

CHEN X L , HE K M . Exploring simple Siamese representation learning [C ] // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway:IEEE Press , 2021 : 15745 - 15753 .

HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C ] // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway:IEEE Press , 2016 : 770 - 778 .

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . Massachusetts:MIT Press , 2017 : 6000 - 6010 .

PANAYOTOV V , CHEN G G , POVEY D , et al . Librispeech:an ASR corpus based on public domain audio books [C ] // Proceedings of 2015 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2015 : 5206 - 5210 .

HSU W N , TSAI Y H H , BOLTE B , et al . HuBERT:how much can a bad teacher benefit ASR pre-training? [C ] // Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2021 : 6533 - 6537 .

CHEN S Y , WANG C Y , CHEN Z Y , et al . WavLM:large-scale self-supervised pre-training for full stack speech processing [J ] . arXiv Preprint,arXiv:2110.13900 , 2021 .

CHUNG Y A , HSU W N , TANG H , et al . An unsupervised autoregressive model for speech representation learning [C ] // Proceedings of Interspeech 2019 . Piscataway:IEEE Press , 2019 : 146 - 150 .

DEVLIN J , CHANG M W , LEE K , et al . BERT:pre-training of deep bidirectional transformers for language understanding [C ] // 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.[S.L.:s . n] , 2019 : 4171 - 4186 .

CHUNG Y A , TANG H , GLASS J . Vector-quantized autoregressive predictive coding [C ] // Proceedings of Interspeech 2020 . Piscataway:IEEE Press , 2020 : 3760 - 3764 .

LIU A T , YANG S W , CHI P H , et al . Mockingjay:unsupervised speech representation learning with deep bidirectional transformer encoders [C ] // Proceedings 2020 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2020 : 6419 - 6423 .

LIU A T , LI S W , LEE H Y . TERA:self-supervised learning of transformer encoder representation for speech [J ] . IEEE/ACM Transactions on Audio,Speech,and Language Processing , 2021 , 29 : 2351 - 2366 .

YUE X H , LI H Z . Phonetically motivated self-supervised speech representation learning [C ] // Proceedings of Interspeech 2021 . Piscataway:IEEE Press , 2021 : 746 - 750 .

JIANG D W , LI W B , CAO M , et al . Speech SimCLR:combining contrastive and reconstruction objective for self-supervised speech representation learning [C ] // Proceedings of Interspeech 2021 . Piscataway:IEEE Press , 2021 : 1544 - 1548 .

CHEN T , KORNBLITH S , NOROUZI M , et al . A simple framework for contrastive learning of visual representations [C ] // 2021 International conference on machine learning . New York:PMLR , 2020 : 1597 - 1607 .

ZAIEM S , PARCOLLET T , ESSID S . Pretext Tasks selection for multitask self-supervised speech representation learning [J ] . arXiv Preprint,arXiv:2107.00594 , 2021 .

CHICCO D . Siamese neural networks:an overview [J ] . Artificial Neural Networks , 2021 : 73 - 94 .

HE K M , FAN H Q , WU Y X , et al . Momentum contrast for unsupervised visual representation learning [C ] // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway:IEEE Press , 2020 : 9726 - 9735 .

PARK D S , CHAN W , ZHANG Y , et al . SpecAugment:a simple data augmentation method for automatic speech recognition [C ] // Proceedings of Interspeech 2019 . Piscataway:IEEE Press , 2019 : 2613 - 2617 .

GAROFOLO J S , LAMEL L F , FISHER W M , et al . DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM.NIST speech disc 1-1.1 [R ] . NASA STI/Recon Technical Report N , 1993 .

GULATI A , QIN J , CHIU C C , et al . Conformer:convolution-augmented transformer for speech recognition [C ] // Proceedings of Interspeech 2020 . Piscataway:IEEE Press , 2020 : 5036 - 5040 .

WATANABE S , HORI T , KARITA S , et al . ESPnet:end-to-end speech processing toolkit [C ] // Proceedings of Interspeech 2018 . Piscataway:IEEE Press , 2018 : 2207 - 2211 .

LUGOSCH L , RAVANELLI M , IGNOTO P , et al . Speech model pretraining for end-to-end spoken language understanding [C ] // Proceedings of Interspeech . Piscataway:IEEE Press , 2019 : 814 - 818 .

NAGRANI A , CHUNG J S , XIE W D , et al . Voxceleb:large-scale speaker verification in the wild [J ] . Computer Speech ＆ Language , 2020 ,60:101027.

ANGUERA X , RODRIGUEZ-FUENTES L J , BUZO A , et al . QUESST2014:evaluating query-by-example speech search in a zero-resource setting with real-life queries [C ] // Proceedings of 2015 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2015 : 5833 - 5837 .

SNYDER D , GARCIA-ROMERO D , SELL G , et al . X-vectors:robust DNN embeddings for speaker recognition [C ] // Proceedings of 2018 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2018 : 5329 - 5333 .

浏览量

575

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

面向WSN异常节点检测的融合重构机制与对比学习方法

基于多模态特征的无监督领域自适应多级对抗语义分割网络

四通道无监督学习图像去雾网络

自编码器及其应用综述

在线目标分类及自适应模板更新的孪生网络跟踪算法