浏览全部资源
扫码关注微信
1. 南京邮电大学计算机学院,江苏 南京 210023
2. 江苏省无线传感网高技术研究重点实验室,江苏 南京 210003
[ "王海艳(1974-),女,江苏东台人,南京邮电大学教授,主要研究方向为服务计算、可信计算、大数据应用与云计算技术、隐私保护技术。" ]
[ "曹攀(1991-),男,江苏镇江人,南京邮电大学硕士生,主要研究方向为云计算与物联网技术。" ]
网络出版日期:2016-10,
纸质出版日期:2016-10-25
移动端阅览
王海艳, 曹攀. 基于节点属性与正文内容的海量Web信息抽取方法[J]. 通信学报, 2016,37(10):9-17.
Hai-yan WANG, Pan CAO. Information extraction from massive Web pages based on node property and text content[J]. Journal on communications, 2016, 37(10): 9-17.
王海艳, 曹攀. 基于节点属性与正文内容的海量Web信息抽取方法[J]. 通信学报, 2016,37(10):9-17. DOI: 10.11959/j.issn.1000-436x.2016190.
Hai-yan WANG, Pan CAO. Information extraction from massive Web pages based on node property and text content[J]. Journal on communications, 2016, 37(10): 9-17. DOI: 10.11959/j.issn.1000-436x.2016190.
为解决大数据场景下从海量Web页面中抽取有价值的信息,提出了一种基于节点属性与正文内容的海量Web信息抽取方法。将Web页面转化为DOM树表示,并提出剪枝与融合算法,对DOM树进行简化;定义DOM树节点的密度和视觉属性,根据属性值对Web页面内容进行预处理;引入MapReduce计算框架,实现海量Web信息的并行化抽取。仿真实验结果表明,提出的海量Web信息抽取方法不仅具有更好的性能,还具备较好的系统可扩展性。
To address the problem of extracting valuable information from massive Web pages in big data environments
a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree
and a pruning and fusion algorithm was introduced to simplify the DOM tree.For each node in the DOM tree
both density property and vision property was defined and Web pages were pretreated based on these property values.A MapReduce framework was employed to realize parallel information extraction from massive Web pages.Simulation and experimental results demonstrate that the proposed extraction method can not only achieve better performance but also have higher scalability compared with other methods.
GRISHMAN R Information extraction:techniques and challenges [EB/OL ] . http:// cs.nyu.edu/cs/faculty/grishman/proteus.htm,1997 http:// cs.nyu.edu/cs/faculty/grishman/proteus.htm,1997 .
李蕾 , 周延泉 , 王菁华 . 基于全信息的中文信息抽取系统及应用 [J ] . 北京邮电大学学报 , 2005 , 28 ( 6 ): 48 - 51 .
LI L , ZHOU Y Q , WANG J H . Comprehensive information based chinese information extraction system and application [J ] . Journal of Beijing University of Posts and Telecommunications , 2005 , 28 ( 6 ): 48 - 51 .
黄诗琳 , 郑小琳 , 陈德人 . 针对产品命名实体识别的半监督学习方法 [J ] . 北京邮电大学学报 , 2013 , 36 ( 2 ): 20 - 23 .
HUANG S L , ZHENG X L , CHEN D R . A semi-supervised learning method for product named entity recognition [J ] . Journal of Beijing University of Posts and Telecommunications , 2013 , 36 ( 2 ): 20 - 23 .
秦兵 , 刘安安 , 刘挺 . 无指导的中文开放式实体关系抽取 [J ] . 计算机研究与发展 , 2015 , 52 ( 5 ): 1029 - 1035 .
QIN B , LIU A A , LIU T . Unsupervised Chinese open entity relation extraction [J ] . Journal of Computer Research and Development , 2015 , 52 ( 5 ): 1029 - 1035 .
李天颍 , 刘璘 , 赵德旺 , 等 . 一种基于依存文法的需求文本策略依赖关系抽取方法 [J ] . 计算机学报 , 2013 , 31 ( 1 ): 54 - 62 .
LI T Y , LIU L , ZHAO D W , et al . Eliciting relations from requirements text based on dependency analysis [J ] . Journal of Computers , 2013 , 31 ( 1 ): 54 - 62 .
DENG C , YU S P , WEN J R . VIPS:a vision-based page segmentation [R ] // Microsoft Technical Report,MSR-TR_ 203-79 , 2003 .
NEIL A , HONG J . Visually extracting data records from the deepWeb [C ] // WWW 2013 . Rio,IEEE Press , 2013 : 1233 - 1238 .
NARWAL N , . Improving Web data extraction by noise removal [C ] // ARTCom 2013 . Bangalore,IET , 2013 : 388 - 395 .
SUN F , SONG D , LIAO L . DOM based content extraction via text density [C ] // ACM SIGIR 2011 . Beijing , 2011 : 245 - 254 .
张乃洲 , 曹薇 , 李石君 . 一种基于节点密度分割和标签传播的Web页面挖掘方法 [J ] . 计算机学报 , 2015 , 38 ( 2 ): 349 - 364 .
ZHANG N Z , CAO W , LI S J . A method based on node density segmentation and label propagation for mining Web page [J ] . Journal of Computers , 2015 , 38 ( 2 ): 349 - 364 .
WANG J B , WANG L Z , GAO W L , et al . Chinese Web content extraction based on naive bayes model [C ] // International Federation for Information Processing IFIP . 2014 : 404 - 413 .
KRISHNA S S , DATTATRAYA J S . Schema inference and data extraction from templatized Web pages [C ] // ICPC , 2015 : 1 - 6 .
BHUIYAN M A , ALHASAN M . FSM-H:frequent subgraph mining algorithm in Hadoop [C ] // Big Data . 2014 : 9 - 16 .
JIN S Y , BOULWARE D , KIMMEY D . A parallel spatial co-location mining algorithm based on MapReduce [C ] // Big Data . 2014 : 25 - 31 .
0
浏览量
933
下载量
2
CSCD
关联资源
相关文章
相关作者
相关机构