基于压缩近邻的查重元数据去冗算法设计

姚文斌; 叶鹏迪; 李小勇; 常静坤

doi:10.11959/j.issn.1000-436x.2015226

您当前的位置：

首页 >

文章列表页 >

基于压缩近邻的查重元数据去冗算法设计

学术论文 | 更新时间：2024-06-05

- 基于压缩近邻的查重元数据去冗算法设计
- Deduplication algorithm based on condensed nearest neighbor rule for deduplication metadata
- 通信学报 2015年36卷第8期页码：1-7
- 作者机构：
  
  1. 北京邮电大学智能通信软件与多媒体北京市重点实验室，北京 100876
  2. 北京邮电大学计算机学院，北京 100876
  3. 中国铁道科学研究院机车车辆研究所，北京 100081
  4. 北京邮电大学可信分布式计算与服务教育部重点实验室，北京 100876
- 作者简介：
  
  [ "姚文斌（1972-），男，黑龙江哈尔滨人，北京邮电大学教授、博士生导师，主要研究方向为灾备技术、信息安全、可信计算等。" ]
  [ "叶鹏迪（1986-），男，浙江台州人，中国铁道科学研究院助理研究员，主要研究方向为列车网络控制。" ]
  [ "李小勇（1975-），男，甘肃天水人，北京邮电大学副教授，主要研究方向为分布式计算、网络数据分析预处理、可信计算、网络安全等。" ]
  [ "常静坤（1988-），男，河南焦作人，北京邮电大学博士生，主要研究方向为服务计算、云灾备。" ]
- 基金信息：
  
  国家自然科学基金资助项目(61370069);国家高技术研究发展计划（“863”计划）基金资助项目(2012AA012600);中央高校基本科研业务费专项基金资助项目(BUPT2011RCZJ16)
- DOI：10.11959/j.issn.1000-436x.2015226
  中图分类号： TP391
- 网络首发：2015-08，
  
  纸质出版：2015-08-25
- 稿件说明：
移动端阅览
姚文斌, 叶鹏迪, 李小勇, 等. 基于压缩近邻的查重元数据去冗算法设计[J]. 通信学报, 2015,36(8):1-7.

Wen-bin YAO, Peng-di YE, Xiao-yong LI, et al. Deduplication algorithm based on condensed nearest neighbor rule for deduplication metadata[J]. Journal on Communications, 2015, 36(8): 1-7.
姚文斌, 叶鹏迪, 李小勇, 等. 基于压缩近邻的查重元数据去冗算法设计[J]. 通信学报, 2015,36(8):1-7. DOI： 10.11959/j.issn.1000-436x.2015226.

Wen-bin YAO, Peng-di YE, Xiao-yong LI, et al. Deduplication algorithm based on condensed nearest neighbor rule for deduplication metadata[J]. Journal on Communications, 2015, 36(8): 1-7. DOI： 10.11959/j.issn.1000-436x.2015226.

摘要

随着重复数据删除次数的增加，系统中用于存储指纹索引的清单文件等元数据信息会不断累积，导致不可忽视的存储资源开销。因此，如何在不影响重复数据删除率的基础上，对重复数据删除过程中产生的元数据信息进行压缩，从而减小查重索引，是进一步提高重复数据删除效率和存储资源利用率的重要因素。针对查重元数据中存在大量冗余数据，提出了一种基于压缩近邻的查重元数据去冗算法Dedup

。该算法先利用聚类算法将查重元数据分为若干类，然后利用压缩近邻算法消除查重元数据中相似度较高的数据以获得查重子集，并在该查重子集上利用文件相似性对数据对象进行重复数据删除操作。实验结果表明，Dedup

可以在保持近似的重复数据删除比的基础上，将查重索引大小压缩50%以上。

Abstract

Building effective deduplication index in the memory could reduce disk access times and enhance chunk fingerprint lookup speed

which was a big challenge for deduplication algorithms in massive data environments.As deduplication data set had many samples with high similarity

a deduplication algorithm based on condensed nearest neighbor rule

which was called Dedup

was proposed.Dedup

uses clustering algorithm to divide the original deduplication metadata into several categories.According to these categories

it employs condensed nearest neighbor rule to remove the highest similar data in the deduplication metadata.After that it can get the subset of deduplication metadata.Based on this subset

new data objects will be deduplicated based on the principle of data similarity.The results of experiments show that Dedup

can reduce the size of deduplication data set more than 50% effectively while maintain similar deduplication ratio.

关键词

Keywords

references

ZHU B , LI K , PATTERSON H . Avoiding the disk bottleneck in the data domain deduplication file system [A ] . Proceedings of the 6th USENIX Conference on File and Storage Technologies,USENIX Association [C ] . 2008 . 1 - 14 .

LILLIBRIDGE M , ESHGHI K , BHAGWAT D , et al . Sparse indexing:large scale,inline deduplication using sampling and locality [A ] . Proccedings of the 7th Conference on File and Storage Technologies,USENIX Association [C ] . 2009 . 111 - 123 .

BHAGWAT D , ESHGHI K , LONG D , et al . Extreme binning:scalable,parallel deduplication for chunk-based file backup [A ] . In Modeling,Analysis ＆ Simulation of Computer and Telecommunication Systems,IEEE International Symposium [C ] . IEEE , 2009 . 1 - 9 .

XIA W , JIANG H , FENG D , et al . SiLo:a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput [A ] . Proceedings of the 2011 USENIX Annual Technical Conference(ATC),USENIX Association [C ] . 2011 . 26 - 28 .

ARONOVICH L , ASHER R , BACHMAT E , et al . The design of a similarity based deduplication system [A ] . Proceedings of SYSTOR 2009,The Israeli Experimental Systems Conference [C ] . ACM , 2009 . 1 - 14 .

ROMAŃSKI B , HELDT Ł Ł , KILIAN W , et al . Anchor-driven subchunk deduplication [A ] . Proceedings of the 4th Annual International Conference on Systems and Storage [C ] . 2011 . 16 - 28 .

ZHANG Z , BHAGWAT D , LITWIN W , et al . Improved deduplication through parallel binning [A ] . Performance Computing and Communications Conference(IPCCC),2012 IEEE 31st International [C ] . 2012 . 130 - 141 .

DOUGLIS F , IYENGAR A . Application-specific deltaencoding via resemblance detection [A ] . Proceedings of the 2003 USENIX Annual Technical Conference [C ] . San Antonio,Texas , 2003 . 113 - 126 .

BRODER A Z , MITZENMACHER M . Network applications of Bloom filters:a survey [J ] . Internet Mathematics , 2004 , 1 ( 4 ): 485 - 509 .

TAN L J , YAO W B , LIU Z Y.et al . CDFS:a cloud-based deduplication filesystem [J ] . Advanced Science Letters,American Scientific Publishers , 2012 , 9 ( 1 ): 855 - 860 .

TEODOSIU D , BJORNER N , GUREVICH Y , et al . Optimizing file replication over limited-bandwidth networks using remote differential compression [R ] . Technical Report MSR-TR-2006-157,Microsoft Research , 2006 .

YAO W B , YE P D . Simdedup:a new deduplication scheme based on simhash [A ] . In Web-Age Information Management [C ] . Springer Berlin Heidelberg , 2013 . 79 - 88 .

CHARIKAR M . Similarity estimation techniques from rounding algorithms [A ] . Proc 34th Annual Symposium on Theory of Computing(STOC2002) [C ] . 2002 . 380 - 388 .

MEISTER D , BRINKMANN A . Multi-level comparison of data deduplication in a backup scenario [A ] . Proceedings of SYSTOR 2009,The Israeli Experimental Systems Conference [C ] .ACM, 2009 .

MEYER D T , BOLOSKY W J . A study of practical deduplication [J ] . ACM Transactions on Storage(TOS) , 2012 , 7 ( 4 ): 14 .

WALLACE G , DOUGLIS F , QIAN H , et al . Characteristics of backup workloads in production systems [A ] . Proceedings of the Tenth USENIX Conference on File and Storage Technologies(FAST’12) [C ] . 2012 .

WEI J , JIANG H , ZHOU K , et al . MAD2:a scalable high-throughput exact deduplication approach for network backup services [A ] . Mass Storage Systems and Technologies(MSST),2010 IEEE 26th Symposium [C ] .IEEE, 2010 . 1 - 14 .

KAISER J , MEISTER D , BRINKMANN A , et al . Design of an exact data deduplication cluster [A ] . Mass Storage Systems and Technologies(MSST),2012 IEEE 28th Symposium [C ] .IEEE, 2012 . 1 - 12 .

BALACHANDRAN S , CONSTANTINESCU C . Sequence of hashes compression in data de-duplication [A ] . Data Compression Conference,DCC 2008 [C ] .IEEE, 2008 . 505 .

CONSTANTINESCU C , PIEPER J , LI T . Block size optimization in deduplication systems [A ] . Data Compression Conference,DCC'09 [C ] .IEEE, 2009 . 442 - 442 .

ESHGHI K , LILLIBRIDGE M , WILCOCK L , et al . Jumbo store:providing efficient incremental upload and versioning for a utility rendering service [A ] . FAST [C ] . 2007 . 123 - 138 .

MEISTER D , BRINKMANN A,SÜß T . File recipe compression in data deduplication systems [A ] . Proceedings of 11th USENIX Conference on File and Storage Technologies(FAST) [C ] . 2013 . 175 - 182 .

HART P E . The condensed nearest neighbor rule [J ] . IEEE Transactions on Information Theory IT-14 , 1968 : 515 - 516 .

浏览量

2015

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据