基于Transformer解码的端到端场景文本检测与识别算法

郑金志; 汲如意; 张立波; 赵琛

doi:10.11959/j.issn.1000-436x.2023070

您当前的位置：

首页 >

文章列表页 >

基于Transformer解码的端到端场景文本检测与识别算法

学术论文 | 更新时间：2024-06-06

- 基于Transformer解码的端到端场景文本检测与识别算法
- End-to-end scene text detection and recognition algorithm based on Transformer decoders
- 通信学报 2023年44卷第5期页码：64-78
- 作者机构：
  
  1. 中国科学院软件研究所智能软件研究中心，北京 100190
  2. 中国科学院大学，北京 100190
  3. 中国科学院软件研究所计算机科学国家重点实验室，北京 100190
- 作者简介：
  
  [ "郑金志（1989- ），男，河南周口人，中国科学院大学博士生，主要研究方向为机器视觉、自然语言处理等" ]
  [ "汲如意（1988- ），男，山东日照人，博士，中国科学院软件研究所助理研究员，主要研究方向为机器学习、计算机视觉、图像处理、模式识别等" ]
  [ "张立波（1989- ），男，安徽阜阳人，博士，中国科学院软件研究所副研究员、硕士生导师，主要研究方向为图像处理、模式识别等" ]
  [ "赵琛（1967- ），男，云南普洱人，博士，中国科学院软件研究所研究员、博士生导师，主要研究方向为编译技术、操作系统、网络软件等" ]
- 基金信息：
- DOI：10.11959/j.issn.1000-436x.2023070
  中图分类号： TP391
- 网络首发：2023-05，
  
  纸质出版：2023-05-25
- 稿件说明：
移动端阅览
郑金志, 汲如意, 张立波, 等. 基于Transformer解码的端到端场景文本检测与识别算法[J]. 通信学报, 2023,44(5):64-78.

Jinzhi ZHENG, Ruyi JI, Libo ZHANG, et al. End-to-end scene text detection and recognition algorithm based on Transformer decoders[J]. Journal on Communications, 2023, 44(5): 64-78.
郑金志, 汲如意, 张立波, 等. 基于Transformer解码的端到端场景文本检测与识别算法[J]. 通信学报, 2023,44(5):64-78. DOI： 10.11959/j.issn.1000-436x.2023070.

Jinzhi ZHENG, Ruyi JI, Libo ZHANG, et al. End-to-end scene text detection and recognition algorithm based on Transformer decoders[J]. Journal on Communications, 2023, 44(5): 64-78. DOI： 10.11959/j.issn.1000-436x.2023070.

摘要

针对任意形状的场景文本检测与识别，提出一种新的端到端场景文本检测与识别算法。首先，引入了文本感知模块基于分割思想的检测分支从卷积网络提取的视觉特征中完成场景文本的检测；然后，由基于Transformer视觉模块和Transformer语言模块组成的识别分支对检测结果进行文本特征的编码；最后，由识别分支中的融合门融合编码的文本特征，输出场景文本。在Total-Text、ICDAR2013和ICDAR2015基准数据集上进行的实验结果表明，所提算法在召回率、准确率和F值上均表现出了优秀的性能，且时间效率具有一定的优势。

Abstract

Aiming at the detection and recognition task of arbitrary shape text in scene

a novelty scene text detection and recognition algorithm which could be trained by end-to-end algorithm was proposed.Firstly

the detection branch of text aware module based on segmentation idea was introduced to detect scene text from visual features extracted by convolutional network.Then

a recognition branch based on Transformer vision module and Transformer language module encoded the text features of the detection results.Finally

the text features encoded by the fusion gate in the recognition branch were fused to output the scene text.The experimental results on the three benchmark datasets of Total-Text

ICDAR2013 and ICDAR2015 show that the proposed algorithm has excellent performance in recall

precision

F-score

and has certain advantages in efficiency.

关键词

Keywords

references

LONG S B , HE X , YAO C . Scene text detection and recognition:the deep learning era [J ] . International Journal of Computer Vision , 2021 , 129 ( 1 ): 161 - 184 .

陈卓 , 王国胤 , 刘群 . 结合多粒度特征融合的自然场景文本检测方法 [J ] . 计算机科学 , 2021 , 48 ( 12 ): 243 - 248 .

CHEN Z , WANG G Y , LIU Q . Natural scene text detection algorithm combining multi-granularity feature fusion [J ] . Computer Science , 2021 , 48 ( 12 ): 243 - 248 .

邵海琳 , 季怡 , 刘纯平 , 等 . 基于增强特征金字塔网络的场景文本检测算法 [J ] . 计算机科学 , 2022 , 49 ( 2 ): 248 - 255 .

SHAO H L , JI Y , LIU C P , et al . Scene text detection algorithm based on enhanced feature pyramid network [J ] . Computer Science , 2022 , 49 ( 2 ): 248 - 255 .

丁明宇 , 牛玉磊 , 卢志武 , 等 . 基于深度学习的图片中商品参数识别方法 [J ] . 软件学报 , 2018 , 29 ( 4 ): 1039 - 1048 .

DING M Y , NIU Y L , LU Z W , et al . Deep learning for parameter recognition in commodity images [J ] . Journal of Software , 2018 , 29 ( 4 ): 1039 - 1048 .

LI H , WANG P , SHEN C H . Towards end-to-end text spotting with convolutional recurrent neural networks [C ] // Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway:IEEE Press , 2017 : 5248 - 5256 .

LYU P Y , LIAO M H , YAO C , et al . Mask TextSpotter:an end-to-end trainable neural network for spotting text with arbitrary shapes [C ] // European Conference on Computer Vision . Berlin:Springer , 2018 : 71 - 88 .

XING L J , TIAN Z , HUANG W L , et al . Convolutional character networks [C ] // Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway:IEEE Press , 2020 : 9125 - 9135 .

LI H , WANG P , SHEN C H , et al . Show,attend and read:a simple and strong baseline for irregular text recognition [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Palo Alto:AAAI Press , 2019 : 8610 - 8617 .

YU D L , LI X , ZHANG C Q , et al . Towards accurate scene text recognition with semantic reasoning networks [C ] // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway:IEEE Press , 2020 : 12110 - 12119 .

YUE X Y , KUANG Z H , LIN C H , et al . RobustScanner:dynamically enhancing positional clues for robust text recognition [C ] // European Conference on Computer Vision . Berlin:Springer , 2020 : 135 - 151 .

FANG S C , XIE H T , WANG Y X , et al . Read like humans:autonomous,bidirectional and iterative language modeling for scene text recognition [C ] // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway:IEEE Press , 2021 : 7094 - 7103 .

FENG W , HE W H , YIN F , et al . TextDragon:an end-to-end framework for arbitrary shaped text spotting [C ] // Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway:IEEE Press , 2020 : 9075 - 9084 .

LIAO M H , LYU P Y , HE M H , et al . Mask TextSpotter:an end-to-end trainable neural network for spotting text with arbitrary shapes [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2021 , 43 ( 2 ): 532 - 548 .

LIU Y L , CHEN H , SHEN C H , et al . ABCNet:real-time scene text spotting with adaptive bezier-curve network [C ] // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway:IEEE Press , 2020 : 9806 - 9815 .

LIAO M H , PANG G , HUANG J , et al . Mask TextSpotter v3:segmentation proposal network for robust scene text spotting [C ] // European Conference on Computer Vision . Berlin:Springer , 2020 : 706 - 722 .

王建新 , 王子亚 , 田萱 . 基于深度学习的自然场景文本检测与识别综述 [J ] . 软件学报 , 2020 , 31 ( 5 ): 1465 - 1496 .

WANG J X , WANG Z Y , TIAN X . Review of natural scene text detection and recognition based on deep learning [J ] . Journal of Software , 2020 , 31 ( 5 ): 1465 - 1496 .

BAEK Y , LEE B , HAN D , et al . Character region awareness for text detection [C ] // Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway:IEEE Press , 2020 : 9357 - 9366 .

ZHANG S X , ZHU X B , HOU J B , et al . Deep relational reasoning graph network for arbitrary shape text detection [C ] // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway:IEEE Press , 2020 : 9696 - 9705 .

TIAN Z T , SHU M , LYU P Y , et al . Learning shape-aware embedding for scene text detection [C ] // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway:IEEE Press , 2020 : 4229 - 4238 .

李煌 , 王晓莉 , 项欣光 . 基于文本三区域分割的场景文本检测方法 [J ] . 计算机科学 , 2020 , 47 ( 11 ): 142 - 147 .

LI H , WANG X L , XIANG X G . Scene text detection based on triple segmentation [J ] . Computer Science , 2020 , 47 ( 11 ): 142 - 147 .

LI J C , LIN Y , LIU R R , et al . RSCA:real-time segmentation-based context-aware scene text detection [C ] // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) . Piscataway:IEEE Press , 2021 : 2349 - 2358 .

LIAO M H , ZOU Z S , WAN Z Y , et al . Real-time scene text detection with differentiable binarization and adaptive scale fusion [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 1 ): 919 - 931 .

SHENG F F , CHEN Z N , XU B . NRTR:a no-recurrence sequence-to-sequence model for scene text recognition [C ] // Proceedings of International Conference on Document Analysis and Recognition (ICDAR) . Piscataway:IEEE Press , 2020 : 781 - 786 .

YANG L , DANG F , WANG P , et al . A holistic representation guided attention network for scene text recognition [J ] . arXiv Preprint,arXiv:1904.01375v3 , 2019 .

QIAO L , TANG S L , CHENG Z Z , et al . Text perceptron:towards end-to-end arbitrary-shaped text spotting [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Palo Alto:AAAI Press , 2020 : 11899 - 11907 .

WANG P F , ZHANG C Q , QI F , et al . PGNet:real-time arbitrarily-shaped text spotting with point gathering network [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Palo Alto:AAAI Press , 2021 : 2782 - 2790 .

LIU X B , LIANG D , YAN S , et al . FOTS:fast oriented text spotting with a unified network [C ] // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway:IEEE Press , 2018 : 5676 - 5685 .

HE T , TIAN Z , HUANG W L , et al . An end-to-end TextSpotter with explicit alignment and attention [C ] // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway:IEEE Press , 2018 : 5020 - 5029 .

QIN S Y , BISSACO A , RAPTIS M , et al . Towards unconstrained end-to-end text spotting [C ] // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway:IEEE Press , 2020 : 4703 - 4713 .

QIAO L , CHEN Y , CHENG Z Z , et al . MANGO:a mask attention guided one-stage scene text spotter [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Palo Alto:AAAI Press , 2021 , 35 ( 3 ): 2467 - 2476 .

HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C ] // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway:IEEE Press , 2016 : 770 - 778 .

ZHOU X Y , YAO C , WEN H , et al . EAST:an efficient and accurate scene text detector [C ] // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway:IEEE Press , 2017 : 2642 - 2651 .

LIAO M H , WAN Z Y , YAO C , et al . Real-time scene text detection with differentiable binarization [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Palo Alto:AAAI Press , 2020 : 11474 - 11481 .

WANG W H , XIE E Z , LI X , et al . Shape robust text detection with progressive scale expansion network [C ] // Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway:IEEE Press , 2020 : 9328 - 9337 .

VATTI B R . A generic solution to polygon clipping [J ] . Communications of the ACM , 1992 , 35 ( 7 ): 56 - 63 .

GIRSHICK R . Fast R-CNN [C ] // Proceedings of IEEE International Conference on Computer Vision (ICCV) . Piscataway:IEEE Press , 2016 : 1440 - 1448 .

MILLETARI F , NAVAB N , AHMADI S A . V-net:fully convolutional neural networks for volumetric medical image segmentation [C ] // Proceedings of 2016 Fourth International Conference on 3D Vision (3DV) . Piscataway:IEEE Press , 2016 : 565 - 571 .

GUPTA A , VEDALDI A , ZISSERMAN A . Synthetic data for text localisation in natural images [C ] // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway:IEEE Press , 2016 : 2315 - 2324 .

KARATZAS D , SHAFAIT F , UCHIDA S , et al . ICDAR 2013 robust reading competition [C ] // Proceedings of 2013 12th International Conference on Document Analysis and Recognition . Piscataway:IEEE Press , 2013 : 1484 - 1493 .

CH'NG C K , CHAN C S . Total-text:a comprehensive dataset for scene text detection and recognition [C ] // Proceedings of 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) . Piscataway:IEEE Press , 2018 : 935 - 942 .

KARATZAS D , GOMEZ-BIGORDA L , NICOLAOU A , et al . ICDAR 2015 competition on robust reading [C ] // Proceedings of 2015 13th International Conference on Document Analysis and Recognition (ICDAR) . Piscataway:IEEE Press , 2015 : 1156 - 1160 .

ZHONG Z , JIN L , ZHANG S , et al . DeepText:a unified framework for text proposal generation and text detection in natural images [J ] . arXiv Preprint,arXiv:1605.07314v1 , 2016 .

LIAO M H , SHI B G , BAI X . TextBoxes++:a single-shot oriented scene text detector [J ] . IEEE Transactions on Image Processing:a Publication of the IEEE Signal Processing Society , 2018 , 27 ( 8 ): 3676 - 3690 .

WANG H , LU P , ZHANG H , et al . All You need is boundary:toward arbitrary-shaped text spotting [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . Palo Alto:AAAI Press , 2020 : 12160 - 12167 .

LIU Y L , SHEN C H , JIN L W , et al . ABCNet v2:adaptive bezier-curve network for real-time end-to-end text spotting [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022 , 44 ( 11 ): 8048 - 8064 .

TANG J Q , QIAO S , CUI B L , et al . You can even annotate text with voice:transcription-only-supervised text spotting [C ] // Proceedings of the 30th ACM International Conference on Multimedia . New York:ACM Press , 2022 : 4154 - 4163 .

PENG D , WANG X , LIU Y , et al . SPTS:single-point text spotting [J ] . arXiv Preprint,arXiv:2112.07917 , 2021 .

LIU Y , ZHANG J , PENG D , et al . SPTS v2:single-point scene text spotting [J ] . arXiv Preprint,arXiv:2301.01635v1 , 2023 .

浏览量

1027

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于多尺度卷积融合编码网络的调制识别方法

基于SVM-RFE与Transformer-TBAM的高校邮件分析研究

基于时空Transformer特征融合的车辆轨迹预测

Na⁺离子掺杂Gd₂O₃ ∶ Sm³⁺纳米晶的发光增强

激发波长和Eu²⁺的掺杂量对Sr₄Al₁₄O₂₅ ∶ Eu²⁺ , Dy³⁺发光性能的影响