基于Bert和BiLSTM-CRF的APT攻击实体识别及对齐研究

杨秀璋; 彭国军; 李子川; 吕杨琦; 刘思德; 李晨光

doi:10.11959/j.issn.1000-436x.2022116

您当前的位置：

首页 >

文章列表页 >

基于Bert和BiLSTM-CRF的APT攻击实体识别及对齐研究

学术论文 | 更新时间：2024-06-06

- 基于Bert和BiLSTM-CRF的APT攻击实体识别及对齐研究
- Research on entity recognition and alignment of APT attack based on Bert and BiLSTM-CRF
- 通信学报 2022年43卷第6期页码：58-70
- 作者机构：
  
  1. 武汉大学空天信息安全与可信计算教育部重点实验室，湖北武汉 430072
  2. 武汉大学国家网络安全学院，湖北武汉 430072
- 作者简介：
  
  [ "杨秀璋（1991- ），男，贵州凯里人，武汉大学博士生，主要研究方向为网络与信息系统安全" ]
  [ "彭国军（1979- ），男，湖北荆州人，博士，武汉大学教授、博士生导师，主要研究方向为网络与信息系统安全" ]
  [ "李子川（1999- ），男，河北邯郸人，武汉大学硕士生，主要研究方向为IoT安全、漏洞自动化挖掘与利用" ]
  [ "吕杨琦（1997- ），女，湖北孝感人，武汉大学硕士生，主要研究方向为网络与信息系统安全" ]
  [ "刘思德（1997- ），男，湖北荆州人，武汉大学博士生，主要研究方向为恶意代码检测与系统安全" ]
  [ "李晨光（1999- ），男，湖北十堰人，武汉大学硕士生，主要研究方向为网络与信息系统安全" ]
- 基金信息：
  
  国家自然科学基金资助项目(62172308);国家自然科学基金资助项目(U1626107);国家自然科学基金资助项目(61972297);国家自然科学基金资助项目(62172144)
- DOI：10.11959/j.issn.1000-436x.2022116
  中图分类号： TP309
- 网络出版日期：2022-06，
  
  纸质出版日期：2022-06-25
- 稿件说明：
移动端阅览
杨秀璋, 彭国军, 李子川, 等. 基于Bert和BiLSTM-CRF的APT攻击实体识别及对齐研究[J]. 通信学报, 2022,43(6):58-70.

Xiuzhang YANG, Guojun PENG, Zichuan LI, et al. Research on entity recognition and alignment of APT attack based on Bert and BiLSTM-CRF[J]. Journal on communications, 2022, 43(6): 58-70.
杨秀璋, 彭国军, 李子川, 等. 基于Bert和BiLSTM-CRF的APT攻击实体识别及对齐研究[J]. 通信学报, 2022,43(6):58-70. DOI： 10.11959/j.issn.1000-436x.2022116.

Xiuzhang YANG, Guojun PENG, Zichuan LI, et al. Research on entity recognition and alignment of APT attack based on Bert and BiLSTM-CRF[J]. Journal on communications, 2022, 43(6): 58-70. DOI： 10.11959/j.issn.1000-436x.2022116.

摘要

目的：面对当前复杂变化的网络安全环境，如何对抗高级可持续威胁（APT）攻击已成为整个安全界亟需解决的问题。安全公司生成的海量APT攻击分析报告和威胁情报具有极其重要的研究价值，它们能有效提供APT组织的动态，从而辅助网络攻击事件的溯源分析。针对APT分析报告未被有效利用，缺乏自动化方法生成结构化知识并形成黑客组织特征画像问题，本文提出一种融合实体识别和实体对齐的APT攻击知识自动抽取方法，旨在从APT分析报告中自动抽取实体，形成APT组织的结构化知识。

方法：设计一种融合实体识别和实体对齐的APT攻击知识自动抽取方法。首先，结合APT攻击特点设计12种实体类别，通过预处理层对语料进行小写转换、数据清洗和数据标注，并将预处理后的APT文本序列表征成向量；其次，通过Bert预训练，对每个词语编码并生成对应的字向量，构建 BiLSTM 模型来捕获长距离和上下文语义特征，再结合注意力机制突出关键特征，将向量序列转换为标注概率矩阵；再次，通过CRF算法对输出预测标签间的关系进行解码，生成最优的标签序列；最后，构建语义相似度和Birch的实体对齐方法，通过知识匹配提升所抽取APT攻击知识的质量，最终融合形成各APT组织的知识消息盒。

结果：在实体识别方面，本文提出的APT攻击实体识别方法比现有常见的实体识别方法（CRF、LSTM-CRF、GRU-CRF、BiLSTM-CRF、CNN-CRF和Bert-CRF）的实验结果均有一定程度的提升，其精确率、召回率和F

值分别为0.929 6、0.873 3和0.900 6。相比于CRF，本文模型的F

值提升了14.32%；相比于融合卷积神经网络的CNN-CRF，本文模型的F

值提升了6.92%；相比于LSTM-CRF和BiLSTM-CRF，本文模型的F

值分别提升了 8.43%和 5.30%；相比于 GRU-CRF，本文模型的F

值提升了 8.74%；相比于 BertCRF，本文模型的F

值提升了7.03%。同时，本文模型的准确率为0.900 4，比其他6种模型的平均值高9.85%。本文模型训练过程更加稳定，整个曲线收敛速度更快，能在较少训练批次下取得较高的准确率；误差随训练周期收敛速度更快，曲线更平缓。此外，本文模型在“攻击手法”实体类别上的预测效果最佳，其F

值为0.927 5，这一方面是由于该类别的实体数量较多，另一方面是该类实体广泛存在于富含语义的APT攻击事件中，并且带有攻击行为的动作特征，从而导致其识别效果更好。在小样本标注的实体识别方面，本文方法的精确率、召回率和F

值分别为0.780 0、0.589 4和0.671 4。其F

值比CRF模型提升了27.42%，比LSTM-CRF模型提升了18.78%，比GRU-CRF模型提升了23.62%，比BiLSTM-CRF模型提升了13.25%，比CNN-CRF模型提升了14.88%，比Bert-CRF模型提升了14.46%。该实验充分说明了本文方法能通过Bert模型对小样本语料开展预训练，从而提升实体识别的效果。在实体对齐与知识融合方面，本文实验自动化抽取各类实体类别出现频率较高的命名实体，这些实体常常存在于APT攻击事件中。比如常见APT组织包括“APT29”“APT32”“APT28”和“Turla”等；常见攻击装备包括“PowerShell”“Cobalt Strike”和“Mimikatz”等；常见攻击手法包括“Spearphishing”“C2”“Watering Hole Attack”和“Backdoor”等；常见漏洞包括“CVE-2017-11882”“CVE-2017-0199”和“CVE-2012-0158”等。本文结合语料标题和关键词对APT组织名称开展实体融合，最终构建了该数据集常见APT组织的知识消息盒，形成各APT组织的结构化知识，并详细展示了APT28和APT32的攻击领域知识。

结论：本文结合APT攻击特点，设计并实现一种融合实体识别和实体对齐的APT攻击知识自动抽取方法。该方法能有效识别APT攻击实体，在少样本标注的情况下自动抽取高级可持续威胁知识，并生成常见APT组织的结构化特征画像，这将为后续APT攻击知识图谱构建和攻击溯源分析提供帮助。

Abstract

Objectives: In the face of the complex and changing network security environment

how to fight against Advanced Persistent Threat (APT) attacks has become an urgent problem for the entire security community. The massive APT attack analysis reports and threat intelligence generated by security companies have significant research value. They can effectively provide the information of APT organizations

thereby assisting in the traceability analysis of network attack events. Aiming at the problem that APT analysis reports have not been fully utilized

and there is a lack of automation methods to generate structured knowledge and construct feature portraits of the hacker organizations

an automatic knowledge extraction method of APT attacks combining entity recognition and entity alignment is proposed. The proposed method can automatically extract entities from APT analysis reports and construct structured knowledge of the APT organization.

Methods: An automatic extraction method of APT attack knowledge that integrates entity recognition and entity alignment is designed. Firstly

12 entity categories are designed according to the characteristics of APT attacks. Then

lowercase conversion

data cleaning

and data annotation are performed on the corpus through the preprocessing layer

and the preprocessed APT text sequence is represented as a vector. Secondly

the Bert model is built to pre-train the annotated corpus

encode each word

and generate the corresponding word vector. Also

the BiLSTM model is constructed to capture long-distance and contextual semantic features. The attention mechanism is built to highlight key features and convert the vector sequence into an annotation probability matrix. Thirdly

the CRF algorithm is utilized to decode the relationship between the output predicted labels and generate the optimal label sequence. Finally

the entity alignment method based on semantic similarity and Birch is constructed

which can improve the quality of the extracted APT attack knowledge through knowledge matching and merging into the infobox of each APT organization.

Results: In terms of entity recognition

the proposed APT attack entity recognition method is superior to the existing entity recognition methods (i.e.

CRF

LSTM-CRF

GRU-CRF

BiLSTMCRF

CNN-CRF

and Bert-CRF). The experimental results of our method have been improved to a certain extent

whose precision

recall

and F1-score are 0.929 6

0.873 3

and 0.900 6. Compared with CRF

the F1-score of the proposed model is increased by 14.32%. Compared with CNN-CRF

which integrates convolutional neural networks

the F1-score of the proposed model is increased by 6.92%. Compared with LSTM-CRF and BiLSTM-CRF

the F1-score of the proposed model is increased by 8.43% and 5.30%

respectively. Compared with GRU-CRF

the F1-score of this model is increased by 8.74%. Compared with Bert-CRF

the F1-score of this model is increased by 7.03%. In addition

the accuracy of the proposed model is 0.9004

which is 9.85% higher than the average of the other six models. Also

the proposed model's training process is more stable

and the entire curve converges faster

which can achieve higher accuracy with fewer training batches. The model's error converges faster in the training period

and the curve is smoother. Moreover

the proposed model has the best prediction effect on the "attack method" entity category

whose F1-score is 0.927 5. On the one hand

a large number of entities exist in this category. On the other hand

this category of entities widely exists in semantic-rich APT attack events and has the action characteristics of attack behavior

which leads to a better recognition effect of this category. In terms of entity recognition with small sample annotation

the proposed method's precision

recall

and F1-score are 0.780 0

0.589 4

and 0.671 4

respectively. Compared with the CRF model

LSTM-CRF model

GRU-CRF model

BiLSTM-CRF model

CNN-CRF model

and Bert-CRF model

the F1-score values of the proposed model are improved by 27.42%

18.78%

23.62%

13.25%

14.88%

and 14.46%. This experiment fully demonstrates that the proposed method can perform pre-training on a small sample corpus through the Bert model

thereby improving the effect of entity recognition. In terms of entity alignment and knowledge fusion

the experiment automatically extracts named entities with the high frequency of various entity categories

which often exist in APT attack events. For example

common APT organizations include "APT29"

"APT32"

"APT28"

and "Turla";common attack equipment includes "PowerShell"

"Cobalt Strike"

and "Mimikatz"; common attack methods include "Spearphishing"

"C2"

"Watering Hole Attack"

and "Backdoor"; common vulnerabilities include "CVE-2017-11882"

"CVE-2017-0199"

and "CVE-2012-0158"

etc. The proposed method combines the corpus titles and keywords to carry out entity fusion of APT organization names. Finally

the infobox of common APT organizations in this dataset is constructed

and the structured knowledge of each APT organization is formed. Also

the attack domain knowledge of APT28 and APT32 is shown in detail.

Conclusions: According to the characteristics of APT attacks

an automatic extraction method of APT attack knowledge based on entity recognition and entity alignment is designed and implemented. This method can effectively identify APT attack entities

automatically extract advanced persistent threat knowledge under the condition of few-sample annotation

and generate structured feature portraits of common APT organizations

which will provide support for subsequent APT attack knowledge graph construction and attack traceability analysis.

关键词

Keywords

references

STOJANOVIĆ B , HOFER-SCHMITZ K , KLEB U . APT datasets and attack modeling for automated detection methods:a review [J ] . Computers ＆ Security , 2020 , 92 : 101734 .

WANG W , ZHU M , ZENG X W , et al . Malware traffic classification using convolutional neural network for representation learning [C ] // Proceedings of 2017 International Conference on Information Networking (ICOIN) . Piscataway:IEEE Press , 2017 : 712 - 717 .

LUO Y , XIAO Y , CHENG L , et al . Deep learning-based anomaly detection in cyber-physical systems:progress and opportunities [J ] . ACM Computing Surveys , 2021 , 54 ( 5 ): 106 : 1 - 36 .

MILAJERDI S M , GJOMEMO R , ESHETE B , et al . HOLMES:real-time APT detection through correlation of suspicious information flows [C ] // Proceedings of 2019 IEEE Symposium on Security and Privacy . Piscataway:IEEE Press , 2019 : 1137 - 1152 .

MARCHETTI M , PIERAZZI F , COLAJANNI M , et al . Analysis of high volumes of network traffic for advanced persistent threat detection [J ] . Computer Networks , 2016 , 109 : 127 - 141 .

HAN X Y , PASQUIER T , BATES A , et al . Unicorn:runtime provenance-based detector for advanced persistent threats [C ] // Proceedings 2020 Network and Distributed System Security Symposium . Reston:Internet Society , 2020 : 1 - 19 .

LANGNER R . Stuxnet:dissecting a cyberwarfare weapon [J ] . IEEE Security ＆ Privacy , 2011 , 9 ( 3 ): 49 - 51 .

MUCKIN M , FITCH S C . A threat-driven approach to cyber security [J ] . Lockheed Martin Corporation , 2015 , 3 ( 1 ): 1 - 8 .

宋文纳 , 彭国军 , 傅建明 , 等 . 恶意代码演化与溯源技术研究 [J ] . 软件学报 , 2019 , 30 ( 8 ): 2229 - 2267 .

SONG W N , PENG G J , FU J M , et al . Research on malicious code evolution and traceability technology [J ] . Journal of Software , 2019 , 30 ( 8 ): 2229 - 2267 .

GIURA P , WANG W . A context-based detection framework for advanced persistent threats [C ] // Proceedings of 2012 International Conference on Cyber Security . Piscataway:IEEE Press , 2012 : 69 - 74 .

KIM Y H , PARK W H . A study on cyber threat prediction based on intrusion detection event for APT attack detection [J ] . Multimedia Tools and Applications , 2014 , 71 ( 2 ): 685 - 698 .

付钰 , 李洪成 , 吴晓平 , 等 . 基于大数据分析的APT攻击检测研究综述 [J ] . 通信学报 , 2015 , 36 ( 11 ): 1 - 14 .

FU Y , LI H C , WU X P , et al . Detecting APT attacks:a survey from the perspective of big data analysis [J ] . Journal on Communications , 2015 , 36 ( 11 ): 1 - 14 .

YANG H P , . Method for behavior-prediction of APT attack based on dynamic Bayesian game [C ] // Proceedings of 2016 IEEE International Conference on Cloud Computing and Big Data Analysis . Piscataway:IEEE Press , 2016 : 177 - 182 .

张小松 , 牛伟纳 , 杨国武 , 等 . 基于树型结构的APT攻击预测方法 [J ] . 电子科技大学学报 , 2016 , 45 ( 4 ): 582 - 588 .

ZHANG X S , NIU W N , YANG G W , et al . Method for APT prediction based on tree structure [J ] . Journal of University of Electronic Science and Technology of China , 2016 , 45 ( 4 ): 582 - 588 .

MILAJERDI S M , ESHETE B , GJOMEMO R , et al . POIROT:aligning attack behavior with kernel audit records for cyber threat hunting [C ] // Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security . New York:ACM Press , 2019 : 1813 - 1830 .

HUMPHREYS K , GAIZAUSKAS R , AZZAM S , et al . University of sheffield:description of the LaSIE-II system as used for MUC-7 [C ] // Proceedings of the Seventh Message Understanding Conferences . Stroudsburg:ACL Press , 1998 : 1 - 20 .

BLACK W J , RINALDI F R , MOWATT D . Facile:description of the NE system used for MUC-7 [C ] // Proceedings of the Seventh Message Understanding Conference . Stroudsburg:ACL Press , 1998 : 1 - 10 .

COLLINS M , SINGER Y . Unsupervised models for named entity classification [C ] // Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora . Stroudsburg:ACL Press , 1999 : 100 - 110 .

FREITAG D , MCCALLUM A . Information extraction with HMMs and shrinkage [C ] // Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction . Palo Alto:AAAI Press , 1999 : 31 - 36 .

CHIEU H L , NG H T . Named entity recognition:a maximum entropy approach using global information [C ] // Proceedings of the 19th International Conference on Computational Linguistics . Stroudsburg:ACL Press , 2002 : 1 - 7 .

LI Y Y , BONTCHEVA K , CUNNINGHAM H . SVM based learning system for information extraction [C ] // International Workshop on Deterministic and Statistical Methods in Machine Learning . Berlin:Springer , 2005 : 319 - 339 .

MCCALLUM A , LI W . Early results for named entity recognition with conditional random fields,feature induction and web-enhanced lexicons [C ] // Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL . Stroudsburg:ACL Press , 2003 : 188 - 191 .

HAMMERTON J , . Named entity recognition with long short-term memory [C ] // Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL . Stroudsburg:ACL Press , 2003 : 172 - 175 .

STRUBELL E , VERGA P , BELANGER D , et al . Fast and accurate entity recognition with iterated dilated convolutions [C ] // Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing . Stroudsburg:ACL Press , 2017 : 2670 - 2680 .

ZHANG Y , YANG J . Chinese NER using lattice LSTM [C ] // Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics . Stroudsburg:ACL Press , 2018 : 1554 - 1564 .

张若彬 , 刘嘉勇 , 何祥 . 基于BLSTM-CRF模型的安全漏洞领域命名实体识别 [J ] . 四川大学学报(自然科学版) , 2019 , 56 ( 3 ): 469 - 475 .

ZHANG R B , LIU J Y , HE X . Named entity recognition for vulnerabilities based on BLSTM-CRF model [J ] . Journal of Sichuan University (Natural Science Edition) , 2019 , 56 ( 3 ): 469 - 475 .

DEVLIN J , CHANG M W , LEE K , et al . BERT:pre-training of deep bidirectional transformers for language understanding [C ] // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies . Stroudsburg:ACL Press , 2019 . 4171 - 4186 .

浏览量

902

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

超大规模太赫兹系统深度学习信道估计算法

基于机器学习的加密流量分类研究综述

基于深度学习的SDN异常流量分布式检测方法

基于Ngram-TFIDF的深度恶意代码可视化分类方法

基于后门攻击的恶意流量逃逸方法