浏览全部资源
扫码关注微信
1.北京大学计算中心,北京 100871
2.北京大学长沙计算与数字经济研究院,湖南 长沙 410000
3.华为技术有限公司,江苏 南京 210012
[ "龙汀汀(1997- ),男,湖南邵阳人,北京大学助理工程师,主要研究方向为高性能计算。" ]
[ "付振新(1995- ),男,天津人,北京大学工程师,主要研究方向为高性能计算。" ]
[ "李若淼(1988- ),男,四川成都人,北京大学高级工程师,主要研究方向为高性能计算。" ]
[ "龚翔宇(1982- ),男,江苏无锡人,华为技术有限公司技术专家,主要研究方向为数据中心网络。" ]
[ "吴涛(1985- ),男,江苏南京人,华为技术有限公司技术专家,主要研究方向为数据中心网络、网算协同。" ]
[ "樊春(1977- ),男,重庆人,北京大学正高级工程师,主要研究方向为高性能计算。" ]
收稿日期:2024-10-24,
纸质出版日期:2024-11-30
移动端阅览
龙汀汀,付振新,李若淼等.超算集群无损RoCEv2网络性能评测[J].通信学报,2024,45(Z2):113-121.
LONG Tingting,FU Zhenxin,LI Ruomiao,et al.Performance evaluation of lossless RoCEv2 network in HPC cluster[J].Journal on Communications,2024,45(Z2):113-121.
龙汀汀,付振新,李若淼等.超算集群无损RoCEv2网络性能评测[J].通信学报,2024,45(Z2):113-121. DOI: 10.11959/j.issn.1000-436x.2024244.
LONG Tingting,FU Zhenxin,LI Ruomiao,et al.Performance evaluation of lossless RoCEv2 network in HPC cluster[J].Journal on Communications,2024,45(Z2):113-121. DOI: 10.11959/j.issn.1000-436x.2024244.
为了评估无损RoCEv2网络技术在高性能计算(HPC)领域的实际表现,以无损RoCEv2、TCP/IP和InfiniBand这3种网络为测试对象,搭建HPC集群,使用主流的HPC Benchmark和科学计算应用对上述3种网络进行对比测试,获取各网络的基本性能数据以及在科学计算应用场景下的实际表现,还测试了基于RoCEv2的240节点集群的HPL效率。实验结果表明,在超算集群中,无损RoCEv2与InfiniBand有基本相当的性能,且都显著优于TCP网络。随着集群节点数量的增加,RoCEv2网络具有较好的线性可扩展性。无损RoCEv2网络相对于InfiniBand,在保持成本优势的同时具有大致相当的性能。
To evaluate the practical performance of lossless RoCEv2 network technology in the HPC field
an HPC cluster was set up using three types of networks: lossless RoCEv2
TCP/IP
and InfiniBand. Mainstream HPC benchmarks and scientific computing applications were used to compare these networks
obtaining basic performance data and actual performance in scientific computing scenarios. The HPL efficiency of a 240-node cluster based on RoCEv2 was also tested. The experimental results show that in supercomputing clusters
lossless RoCEv2 has performance roughly equivalent to InfiniBand and is significantly superior to TCP networks. As the number of cluster nodes increases
the RoCEv2 network demonstrates good linear scalability. Compared to InfiniBand
the lossless RoCEv2 network maintains a cost advantage while offering approximately equivalent performance.
BENACCHIO T , BONAVENTURA L , ALTENBERND M , et al . Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction [J ] . The International Journal of High Performance Computing Applications , 2021 , 35 ( 4 ): 285 - 311 .
SCHMIDT B , HILDEBRANDT A . Next-generation sequencing: big data meets high performance computing [J ] . Drug Discovery Today , 2017 , 22 ( 4 ): 712 - 717 .
InfiniBand Trade Association . About InfiniBand™ [R ] . 2021 .
KAUR G , BALA M . Rdma over converged ethernet: a review [J ] . International Journal of Advances in Engineering & Technology , 2013 , 6 ( 4 ): 1890 .
Infiniband Trade Association . InfiniBand trade association releases updated specification for remote direct memory access over converged Ethernet (RoCE) [R ] . 2014 .
SHPINER A , ZAHAVI E , DAHLEY O , et al . RoCE rocks without PFC: detailed evaluation [C ] // Proceedings of the Workshop on Kernel-Bypass Networks . New York : ACM Press , 2017 : 25 - 30 .
DECUSATIS C . Handbook of fiber optic data communication [M ] . Amsterdam : Elsevier Inc , 2013 .
OLMEDILLA C , ESCUDERO-SAHUQUILLO J , GARCIA-GARCIA P J , et al . DVL-lossy: isolating congesting flows to optimize packet dropping in lossy data-center networks [J ] . IEEE Micro , 2021 , 41 ( 1 ): 37 - 44 .
GONZALEZ-NAHARRO L , ESCUDERO-SAHUQUILLO J , GARCIA PJ , et al . Efficient dynamic isolation of congestion in lossless datacenter networks [C ] // Proceedings of the ACM SIGCOMM 2019 Workshop on Networking for Emerging Applications and Technologies . New York : ACM Press , 2019 : 15 - 21 .
GONZALEZ-NAHARRO L , ESCUDERO-SAHUQUILLO J , GARCÍA P J , et al . Modeling traffic workloads in data-center network simulation tools [C ] // Proceedings of the 2019 International Conference on High Performance Computing & Simulation (HPCS) . Piscataway : IEEE Press , 2019 : 1036 - 1042 .
OLMEDILLA C , ESCUDERO-SAHUQUILLO J , GARCÍA P J , et al . Optimizing packet dropping by efficient congesting-flow isolation in lossy data-center networks [C ] // Proceedings of the 2020 IEEE Symposium on High-Performance Interconnects (HOTI) . Piscataway : IEEE Press , 2020 : 47 - 54 .
LI H , CHEN X L , SONG T , et al . Performance of the 25 Gbps/100 Gbps fullmesh RoCE network using mellanox ConnetX-4 lx adapter and Ruijie S6500 Ethernet switch [C ] // Advances in Intelligent Systems and Computing . Berlin : Springer , 2019 : 757 - 767 .
YU Y , QIN W , TIAN Y J , et al . Performance evaluation of HPC cloud cluster [R ] . 2018 .
PASE D M . Linpack HPL performance on IBM eServer 326 and xSeries 336 servers [R ] . 2005 .
iWARP . RDMA here and now technology brief [R ] . 2021 .
IEEE . 802.1Qbb – priority-based flow control [R ] . 2021 .
DONGARRA J J , LUSZCZEK P , PETITET A . The LINPACK Benchmark: past, present and future [J ] . Concurrency and Computation: Practice and Experience , 2003 , 15 ( 9 ): 803 - 820 .
MVAPICH . MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE [R ] . 2021 .
HURRELL J W , HOLLAND M M , GENT P R , et al . The community earth system model: a framework for collaborative research [J ] . Bulletin of the American Meteorological Society , 2013 , 94 ( 9 ): 1339 - 1360 .
VASP - Vienna Ab initio Simulation Package . [R ] . 2021 .
谭光明 , 薛巍 , 翟季冬 , 等 . 2018-2019中国计算机科学技术发展报告 [M ] . 北京 : 机械工业出版社 , 2019 .
TAN G M , XUE W , ZHAI J D , et al . 2018-2019 China computer federation proceedings [M ] . Beijing : China Machine Press , 2019 .
0
浏览量
1
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构