基于信息熵与遗传算法的并行关联规则增量挖掘算法

毛伊敏; 邓千虎; 陈志刚

doi:10.11959/j.issn.1000-436x.2021052

您当前的位置：

首页 >

文章列表页 >

基于信息熵与遗传算法的并行关联规则增量挖掘算法

学术论文 | 更新时间：2024-06-06

- 基于信息熵与遗传算法的并行关联规则增量挖掘算法
- Parallel association rules incremental mining algorithm based on information entropy and genetic algorithm
- 通信学报 2021年42卷第5期页码：122-136
- 作者机构：
  
  1. 江西理工大学信息工程学院，江西赣州 341000
  2. 中南大学计算机学院，湖南长沙 410083
- 作者简介：
  
  [ "毛伊敏（1970- ），女，新疆伊犁人，博士，江西理工大学教授、硕士生导师，主要研究方向为数据挖掘、大数据安全与隐私保护" ]
  [ "邓千虎（1998- ），男，湖北天门人，江西理工大学硕士生，主要研究方向为数据挖掘、大数据" ]
  [ "陈志刚（1964- ），男，湖南益阳人，博士，中南大学教授、博士生导师，主要研究方向为网络与分布式计算、机会网络" ]
- 基金信息：
  
  国家自然科学基金资助项目(41562019);国家自然科学基金资助项目(61762046);国家重点研发计划基金资助项目(2018YFC1504705)
- DOI：10.11959/j.issn.1000-436x.2021052
  中图分类号： TP311
- 网络出版日期：2021-05，
  
  纸质出版日期：2021-05-25
- 稿件说明：
移动端阅览
毛伊敏, 邓千虎, 陈志刚. 基于信息熵与遗传算法的并行关联规则增量挖掘算法[J]. 通信学报, 2021,42(5):122-136.

Yimin MAO, Qianhu DENG, Zhigang CHEN. Parallel association rules incremental mining algorithm based on information entropy and genetic algorithm[J]. Journal on communications, 2021, 42(5): 122-136.
毛伊敏, 邓千虎, 陈志刚. 基于信息熵与遗传算法的并行关联规则增量挖掘算法[J]. 通信学报, 2021,42(5):122-136. DOI： 10.11959/j.issn.1000-436x.2021052.

Yimin MAO, Qianhu DENG, Zhigang CHEN. Parallel association rules incremental mining algorithm based on information entropy and genetic algorithm[J]. Journal on communications, 2021, 42(5): 122-136. DOI： 10.11959/j.issn.1000-436x.2021052.

摘要

针对大数据环境下基于Can树的增量关联规则算法存在树结构空间占用过大、支持度阈值无法动态设置以及Map与Reduce阶段数据传输耗时等问题，提出了一种基于信息熵和遗传算法的并行关联规则增量挖掘算法MR-PARIMIEG。首先，该算法设计基于信息熵的相似项合并策略（SIM-IE）来合并相似数据项，并根据合并后的数据集进行Can树构造，从而减少树结构的空间占用；其次，提出基于遗传算法的DST-GA策略获取大数据环境下相对最优的动态支持度阈值，根据此阈值进行频繁项集挖掘，避免了冗余的频繁模式挖掘导致的时间消耗；最后，在MapReduce并行化运算过程中使用并行LZO数据压缩算法对Map端输出数据进行压缩，从而减少传输的数据规模，最终提升算法的运行速度。实验仿真结果表明，MR-PARIMIEG在大数据环境下进行频繁项集挖掘时具有较好的性能表现，适用于对较大规模的数据集进行并行化处理。

Abstract

Aiming at the problems that in the big data environment

the Can-tree based incremental association rule algorithm had problems such as too much space occupation of the tree structure

inability to dynamically set the support threshold

and too much time consumption during the data transfer process between the Map and Reduce stages

the Map Reduce-based parallel association rules incremental mining algorithm using information entropy and genetic algorithm (MR-PARIMIEG)was proposed.Firstly

a similar items merging based on information entropy (SIM-IE) was designed to merge similar data items

and a Can tree based on the merged data set was constructed

thereby reducing the space occupation of the tree structure.Secondly

the dynamic support threshold obtaining using genetic algorithm (DST-GA) was proposed to obtain the relatively optimal dynamic support threshold in the big data environment

and frequent itemset mining was performed according to this threshold to avoid the unnecessary time consumption caused by mining redundant frequent patterns.Finally

in the process of MapReduce parallel operation

the parallel LZO data compression algorithm was used to compress the output data of the Map stage

thereby reducing the size of the transmitted data

and finally improving the running speed of the algorithm.Experimental simulation results show that MR-PARIMIEG has better performance when mining frequent item sets in the big data environment

and it is suitable for parallel processing of larger data sets.

关键词

Keywords

references

HAND D J , ADAMS N M . Data mining [EB ] .(2015-06-22)[2020-11-13 ] .

KAMSU-FOGUEM B , RIGAL F , MAUGET F . Mining association rules for the quality improvement of the production process [J ] . Expert Systems WithApplications , 2013 , 40 ( 4 ): 1034 - 1045 .

SÁNCHEZ D , VILA M A , CERDA L , et al . Association rules applied to credit card fraud detection [J ] . Expert Systems WithApplications , 2009 , 36 ( 2 ): 3630 - 3640 .

BHANDARI A , GUPTA A , DAS D . Improvised apriori algorithm using frequent pattern tree for real time applications in data mining [J ] . Procedia Computer Science , 2015 , 46 : 644 - 651 .

ZHANG W , LIAO H Z , ZHAO N . Research on the FP growth algorithm about association rule mining [C ] // 2008 International Seminar on Business and Information Management . Piscataway:IEEE Press , 2008 : 315 - 318 .

LI Z F , LIU X F , CAO X . A study on improved eclat data mining algorithm [J ] . Advanced Materials Research , 2011 , 328 : 1896 - 1899 .

LEUNG C K S , KHAN Q I , HOQUE T . CanTree:a tree structure for efficient incremental mining of frequent patterns [C ] // Fifth IEEE International Conference on Data Mining . Piscataway:IEEE Press , 2005 : 1 - 8 .

KUSUMAKUMARI V , SHERIGAR D , CHANDRAN R , et al . Frequent pattern mining on stream data using Hadoop CanTree-GTree [J ] . Procedia Computer Science , 2017 , 115 : 266 - 273 .

DEAN J , GHEMAWAT S . MapReduce:simplified data processing on large clusters [J ] . Communications of the ACM , 2008 , 51 ( 1 ): 107 - 113 .

SONG Y G , CUI H M , FENG X B . Parallel incremental frequent itemset mining for large data [J ] . Journal of Computer Science and Technology , 2017 , 32 ( 2 ): 368 - 385 .

胡军 , 潘皓安 . 基于Can树的关联规则增量更新算法改进 [J ] . 重庆邮电大学学报(自然科学版) , 2018 , 30 ( 4 ): 558 - 563 .

HU J , PAN H A . Improved incremental updating algorithm of association rules based on Can-tree [J ] . Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition) , 2018 , 30 ( 4 ): 558 - 563 .

RAGAVENTHIRAN J , KAVITHADEVI M K . Map-optimize-reduce:CAN tree assisted FP-growth algorithm for clusters based FP mining on Hadoop [J ] . Future Generation Computer Systems , 2020 , 103 : 111 - 122 .

申玲艳 . MapReduce计算模式的性能优化设计及其应用 [J ] . 信息与电脑(理论版) , 2016 ( 14 ): 49 - 50 .

SHEN L Y . MapReduce computing model designed to performance optimization and applications [J ] . China Computer ＆ Communication , 2016 ( 14 ): 49 - 50 .

CAO Y , WANG H . Communication optimisation for intermediate data of MapReduce computing model [J ] . International Journal of Computational Science and Engineering , 2020 , 21 ( 2 ): 226 - 233 .

KIM J , HWANG B . Real-time stream data mining based on CanTree and Gtree [J ] . Information Sciences , 2016 , 367 : 512 - 528 .

LIANG J , SHI Z , LI D , et al . Information entropy,rough entropy and knowledge granulation in incomplete information systems [J ] . International Journal of General Systems , 2006 , 35 ( 6 ): 641 - 654 .

KOHLAS J , MONNEY P A . A mathematical theory of hints:an approach to the Dempster-Shafer theory of evidence [M ] . Berlin : Springer Science ＆ Business Media , 2013 .

KANE J , YANG Q . Compression speed enhancements to LZO for multi-core systems [C ] // 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing . Piscataway:IEEE Press , 2012 : 108 - 115 .

METAWA N , HASSAN M K , ELHOSENY M . Genetic algorithm based model for optimizing bank lending decisions [J ] . Expert Systems With Applications , 2017 , 80 : 75 - 82 .

BART G . Frequent itemset mining dataset repository [EB ] .(2020-03-01)[2020-11-13 ] .

浏览量

461

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

Massive MIMO中通信高效的分布式预编码设计

ERDOF：基于相对熵权密度离群因子的离群点检测算法

基于Storm平台的数据迁移合并节能策略

大数据环境下差分隐私保护技术及应用

基于Shannon信息熵与BP神经网络的隐私数据度量与分级模型