浏览全部资源
扫码关注微信
1. 江西理工大学信息工程学院,江西 赣州 341000
2. 中南大学计算机学院,湖南 长沙 410083
[ "毛伊敏(1970- ),女,新疆伊犁人,博士,江西理工大学教授、博士生导师,主要研究方向为数据挖掘、大数据安全与隐私保护" ]
[ "甘德瑾(1997- ),男,江西抚州人,江西理工大学硕士生,主要研究方向为数据挖掘、大数据" ]
[ "廖列法(1975- ),男,江西玉山人,博士,江西理工大学教授、硕士生导师,主要研究方向为人工智能等" ]
[ "陈志刚(1964- ),男,湖南益阳人,博士,中南大学教授、博士生导师,主要研究方向为网络与分布式计算、机会网络" ]
网络出版日期:2022-03,
纸质出版日期:2022-03-25
移动端阅览
毛伊敏, 甘德瑾, 廖列法, 等. 基于Spark框架和ASPSO的并行划分聚类算法[J]. 通信学报, 2022,43(3):148-163.
Yimin MAO, Dejin GAN, Liefa LIAO, et al. Parallel division clustering algorithm based on Spark framework and ASPSO[J]. Journal on communications, 2022, 43(3): 148-163.
毛伊敏, 甘德瑾, 廖列法, 等. 基于Spark框架和ASPSO的并行划分聚类算法[J]. 通信学报, 2022,43(3):148-163. DOI: 10.11959/j.issn.1000-436x.2022054.
Yimin MAO, Dejin GAN, Liefa LIAO, et al. Parallel division clustering algorithm based on Spark framework and ASPSO[J]. Journal on communications, 2022, 43(3): 148-163. DOI: 10.11959/j.issn.1000-436x.2022054.
针对划分聚类算法处理海量的数据存在的数据离散系数较大与抗干扰性差、局部簇簇数难以确定、局部簇质心随机性及局部簇并行化合并效率低等问题,提出了一种基于Spark框架和粒子群优化自适应策略(ASPSO)的并行划分聚类(PDC-SFASPSO)算法。首先,提出了基于皮尔逊相关系数和方差的网格划分策略获取数据离散系数较小的网格单元并进行离群点过滤,解决了数据离散系数较大与抗干扰性差的问题;其次,提出了基于势函数与高斯函数的网格划分策略,获取局部聚类的簇数,解决了局部簇簇数难以确定的问题;再次,提出了ASPSO获取局部簇质心,解决了局部簇质心的随机性问题;最后,提出了基于簇半径与邻居节点的合并策略对相似度大的簇进行并行化合并,提高了局部簇并行化合并的效率。实验结果表明,PDC-SFASPSO 算法在大数据环境下进行数据的划分聚类具有较好的性能表现,适用于对大规模的数据集进行并行化聚类。
To deal with the problems that the partition clustering algorithm for processing massive data encountered problems such as large data dispersion coefficient and poor anti-interference
difficulty to determine the number of local clusters
local cluster centroids randomness
and low efficiency of local cluster parallelization and merging
a parallel partition clustering algorithm based on Spark framework and ASPSO (PDC-SFAS PSO) was proposed.Firstly
a meshing strategy was introduced to reduce the data dispersion coefficient of the data division and improve anti-interference.Secondly
to determine the number of clusters
meshing strategy based on potential function and Gaussian function were proposed
which formed an area with different sample points as the core clusters
and obtained the number of local clusters.Then
to avoid local cluster centroids randomness
ASPSO was proposed.Finally
a local cluster merging strategy based on cluster radius and neighbor nodes was introduced to merge clusters with large similarity based on the Spark parallel computing framework
which improved the efficiency of parallel merging of local clusters.Experimental results showed that the PDC-SFASPSO algorithm has good performance in data partitioning and clustering in a big data environment
and it was suitable for parallel clustering of large-scale data sets.
WANG P K , CHEN C H , PUN S H , et al . Parallel architecture to accelerate super paramagnetic clustering algorithm [J ] . Electronics Letters , 2020 , 56 ( 14 ): 701 - 704 .
KHAN A , ZUBAIR S . Expansion of regularized kmeans discretization machine learning approach in prognosis of dementia progression [C ] // Proceedings of 2020 11th International Conference on Computing,Communication and Networking Technologies . Piscataway:IEEE Press , 2020 : 1 - 6 .
MARTANTO , ANWAR S , ROHMAT C L , et al . Clustering of Internet network usage using the K-medoid method [J ] . IOP Conference Series:Materials Science and Engineering , 2021 , 1088 ( 1 ): 012036 .
SCHUBERT E , ROUSSEEUW P J . Fast and eager k-medoids clustering:O(k) runtime improvement of the PAM,CLARA,and CLARANS algorithms [J ] . Information Systems , 2021 , 101 : 101804 .
LEKHWAR S , YADAV S , SINGH A . Big data analytics in retail [R ] . 2019 .
WEISSMAN B , VAN D L E . Working with spark in big data clusters [R ] . 2020 .
MUGDHA S , CHIRAG P , AKASH A . Design and implementation of university network [J ] . International Journal of Recent Technology and Engineering , 2019 , 8 ( 26 ): 1199 - 1214 .
王海艳 , 肖亦康 . 基于密度峰值聚类的动态群组发现方法 [J ] . 计算机研究与发展 , 2018 , 55 ( 2 ): 391 - 399 .
WANG H Y , XIAO Y K . Dynamic group discovery based on density peaks clustering [J ] . Journal of Computer Research and Development , 2018 , 55 ( 2 ): 391 - 399 .
WANG B W , YIN J , HUA Q , et al . Parallelizing K-means-based clustering on spark [C ] // Proceedings of 2016 International Conference on Advanced Cloud and Big Data . Piscataway:IEEE Press , 2016 : 31 - 36 .
徐鹏程 , 王诚 . K-means算法改进及基于Spark计算模型的实现 [J ] . 南京邮电大学学报(自然科学版) , 2017 , 37 ( 4 ): 113 - 118 .
XU P C , WANG C . Improvement of K-means algorithm and implementation based on Spark computing model [J ] . Journal of Nanjing University of Posts and Telecommunications (Natural Science Edition) , 2017 , 37 ( 4 ): 113 - 118 .
MULTAZAM M T , DIJAYA R , DEVI N S . Index group optimization based on automatic clustering using K-means genetic algorithm [J ] . Journal of Physics:Conference Series , 2019 , 1402 ( 6 ): 066028 .
许明杰 , 蔚承建 , 沈航 . 基于 Spark 的并行 K-means 算法研究 [J ] . 微电子学与计算机 , 2018 , 35 ( 5 ): 95 - 99 .
XU M J , WEI C J , SHEN H . Research on K-means algorithm of Spark parallelization [J ] . Microelectronics & Computer , 2018 , 35 ( 5 ): 95 - 99 .
GAO H J , LI Y T , KABALYANTS P , et al . A novel hybrid PSO-K-means clustering algorithm using Gaussian estimation of distribution method and Lévy flight [J ] . IEEE Access , 2020 , 8 : 122848 - 122863 .
AGRAWAL S , PATEL A . SAG cluster:an unsupervised graph clustering based on collaborative similarity for community detection in complex networks [J ] . Physica A:Statistical Mechanics and Its Applications , 2021 , 563 : 125459 .
LAI M J , MCKENZIE D . Compressive sensing for cut improvement and local clustering [J ] . SIAM Journal on Mathematics of Data Science , 2020 , 2 ( 2 ): 368 - 395 .
裴继红 , 谢维信 . 势函数聚类自适应多阈值图像分割 [J ] . 计算机学报 , 1999 , 22 ( 7 ): 758 - 762 .
PEI J H , XIE W X . Adaptive multi thresholds image segmentation based on potential function clustering [J ] . Chinese Journal of Computers , 1999 , 22 ( 7 ): 758 - 762 .
ZHANG Y L , HAN J . Differential privacy fuzzy C-means clustering algorithm based on Gaussian kernel function [J ] . PLoS One , 2021 , 16 ( 3 ): e0248737 .
赵姝 , 许显胜 , 华波 , 等 . 收缩邻居节点集方法求解有向网络的最大流问题 [J ] . 模式识别与人工智能 , 2013 , 26 ( 5 ): 425 - 431 .
ZHAO S , XU X S , HUA B , et al . Contracting neighbor-node-set approach for solving maximum flow problem in directed network [J ] . Pattern Recognition and Artificial Intelligence , 2013 , 26 ( 5 ): 425 - 431 .
PAULCHAMY B , CHIDAMBARAM S , JAYA J . An energy efficient neighbor node based clustering (EENNC) algorithm for wireless sensor networks [J ] . Journal of Xidian University , 2020 , 14 ( 6 ): 2483 - 2493 .
SUN C , YUE S H , LI Q . Clustering characteristics of UCI dataset [C ] // Proceedings of 2020 39th Chinese Control Conference (CCC) . Piscataway:IEEE Press , 2020 , 13 ( 5 ): 428 - 439 .
DASH D R , DASH P K , BISOI R . Short term solar power forecasting using hybrid minimum variance expanded RVFLN and sine-cosine Levy flight PSO algorithm [J ] . Renewable Energy , 2021 , 174 : 513 - 537 .
0
浏览量
427
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构