浏览全部资源
扫码关注微信
1. 江西理工大学信息工程学院,江西 赣州 341000
2. 中南大学计算机学院,湖南 长沙 410083
[ "毛伊敏(1970- ),女,新疆伊犁人,博士,江西理工大学教授、硕士生导师,主要研究方向为数据挖掘、大数据安全与隐私保护" ]
[ "邓千虎(1998- ),男,湖北天门人,江西理工大学硕士生,主要研究方向为数据挖掘、大数据" ]
[ "陈志刚(1964- ),男,湖南益阳人,博士,中南大学教授、博士生导师,主要研究方向为网络与分布式计算、机会网络" ]
网络出版日期:2021-05,
纸质出版日期:2021-05-25
移动端阅览
毛伊敏, 邓千虎, 陈志刚. 基于信息熵与遗传算法的并行关联规则增量挖掘算法[J]. 通信学报, 2021,42(5):122-136.
Yimin MAO, Qianhu DENG, Zhigang CHEN. Parallel association rules incremental mining algorithm based on information entropy and genetic algorithm[J]. Journal on communications, 2021, 42(5): 122-136.
毛伊敏, 邓千虎, 陈志刚. 基于信息熵与遗传算法的并行关联规则增量挖掘算法[J]. 通信学报, 2021,42(5):122-136. DOI: 10.11959/j.issn.1000-436x.2021052.
Yimin MAO, Qianhu DENG, Zhigang CHEN. Parallel association rules incremental mining algorithm based on information entropy and genetic algorithm[J]. Journal on communications, 2021, 42(5): 122-136. DOI: 10.11959/j.issn.1000-436x.2021052.
针对大数据环境下基于Can树的增量关联规则算法存在树结构空间占用过大、支持度阈值无法动态设置以及Map与Reduce阶段数据传输耗时等问题,提出了一种基于信息熵和遗传算法的并行关联规则增量挖掘算法MR-PARIMIEG。首先,该算法设计基于信息熵的相似项合并策略(SIM-IE)来合并相似数据项,并根据合并后的数据集进行Can树构造,从而减少树结构的空间占用;其次,提出基于遗传算法的DST-GA策略获取大数据环境下相对最优的动态支持度阈值,根据此阈值进行频繁项集挖掘,避免了冗余的频繁模式挖掘导致的时间消耗;最后,在MapReduce并行化运算过程中使用并行LZO数据压缩算法对Map端输出数据进行压缩,从而减少传输的数据规模,最终提升算法的运行速度。实验仿真结果表明,MR-PARIMIEG在大数据环境下进行频繁项集挖掘时具有较好的性能表现,适用于对较大规模的数据集进行并行化处理。
Aiming at the problems that in the big data environment
the Can-tree based incremental association rule algorithm had problems such as too much space occupation of the tree structure
inability to dynamically set the support threshold
and too much time consumption during the data transfer process between the Map and Reduce stages
the Map Reduce-based parallel association rules incremental mining algorithm using information entropy and genetic algorithm (MR-PARIMIEG)was proposed.Firstly
a similar items merging based on information entropy (SIM-IE) was designed to merge similar data items
and a Can tree based on the merged data set was constructed
thereby reducing the space occupation of the tree structure.Secondly
the dynamic support threshold obtaining using genetic algorithm (DST-GA) was proposed to obtain the relatively optimal dynamic support threshold in the big data environment
and frequent itemset mining was performed according to this threshold to avoid the unnecessary time consumption caused by mining redundant frequent patterns.Finally
in the process of MapReduce parallel operation
the parallel LZO data compression algorithm was used to compress the output data of the Map stage
thereby reducing the size of the transmitted data
and finally improving the running speed of the algorithm.Experimental simulation results show that MR-PARIMIEG has better performance when mining frequent item sets in the big data environment
and it is suitable for parallel processing of larger data sets.
HAND D J , ADAMS N M . Data mining [EB ] .(2015-06-22)[2020-11-13 ] .
KAMSU-FOGUEM B , RIGAL F , MAUGET F . Mining association rules for the quality improvement of the production process [J ] . Expert Systems WithApplications , 2013 , 40 ( 4 ): 1034 - 1045 .
SÁNCHEZ D , VILA M A , CERDA L , et al . Association rules applied to credit card fraud detection [J ] . Expert Systems WithApplications , 2009 , 36 ( 2 ): 3630 - 3640 .
BHANDARI A , GUPTA A , DAS D . Improvised apriori algorithm using frequent pattern tree for real time applications in data mining [J ] . Procedia Computer Science , 2015 , 46 : 644 - 651 .
ZHANG W , LIAO H Z , ZHAO N . Research on the FP growth algorithm about association rule mining [C ] // 2008 International Seminar on Business and Information Management . Piscataway:IEEE Press , 2008 : 315 - 318 .
LI Z F , LIU X F , CAO X . A study on improved eclat data mining algorithm [J ] . Advanced Materials Research , 2011 , 328 : 1896 - 1899 .
LEUNG C K S , KHAN Q I , HOQUE T . CanTree:a tree structure for efficient incremental mining of frequent patterns [C ] // Fifth IEEE International Conference on Data Mining . Piscataway:IEEE Press , 2005 : 1 - 8 .
KUSUMAKUMARI V , SHERIGAR D , CHANDRAN R , et al . Frequent pattern mining on stream data using Hadoop CanTree-GTree [J ] . Procedia Computer Science , 2017 , 115 : 266 - 273 .
DEAN J , GHEMAWAT S . MapReduce:simplified data processing on large clusters [J ] . Communications of the ACM , 2008 , 51 ( 1 ): 107 - 113 .
SONG Y G , CUI H M , FENG X B . Parallel incremental frequent itemset mining for large data [J ] . Journal of Computer Science and Technology , 2017 , 32 ( 2 ): 368 - 385 .
胡军 , 潘皓安 . 基于Can树的关联规则增量更新算法改进 [J ] . 重庆邮电大学学报(自然科学版) , 2018 , 30 ( 4 ): 558 - 563 .
HU J , PAN H A . Improved incremental updating algorithm of association rules based on Can-tree [J ] . Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition) , 2018 , 30 ( 4 ): 558 - 563 .
RAGAVENTHIRAN J , KAVITHADEVI M K . Map-optimize-reduce:CAN tree assisted FP-growth algorithm for clusters based FP mining on Hadoop [J ] . Future Generation Computer Systems , 2020 , 103 : 111 - 122 .
申玲艳 . MapReduce计算模式的性能优化设计及其应用 [J ] . 信息与电脑(理论版) , 2016 ( 14 ): 49 - 50 .
SHEN L Y . MapReduce computing model designed to performance optimization and applications [J ] . China Computer & Communication , 2016 ( 14 ): 49 - 50 .
CAO Y , WANG H . Communication optimisation for intermediate data of MapReduce computing model [J ] . International Journal of Computational Science and Engineering , 2020 , 21 ( 2 ): 226 - 233 .
KIM J , HWANG B . Real-time stream data mining based on CanTree and Gtree [J ] . Information Sciences , 2016 , 367 : 512 - 528 .
LIANG J , SHI Z , LI D , et al . Information entropy,rough entropy and knowledge granulation in incomplete information systems [J ] . International Journal of General Systems , 2006 , 35 ( 6 ): 641 - 654 .
KOHLAS J , MONNEY P A . A mathematical theory of hints:an approach to the Dempster-Shafer theory of evidence [M ] . Berlin : Springer Science & Business Media , 2013 .
KANE J , YANG Q . Compression speed enhancements to LZO for multi-core systems [C ] // 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing . Piscataway:IEEE Press , 2012 : 108 - 115 .
METAWA N , HASSAN M K , ELHOSENY M . Genetic algorithm based model for optimizing bank lending decisions [J ] . Expert Systems With Applications , 2017 , 80 : 75 - 82 .
BART G . Frequent itemset mining dataset repository [EB ] .(2020-03-01)[2020-11-13 ] .
0
浏览量
461
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构