Information extraction from massive Web pages based on node property and text content

Hai-yan WANG; Pan CAO

doi:10.11959/j.issn.1000-436x.2016190

您当前的位置：

首页 >

文章列表页 >

Information extraction from massive Web pages based on node property and text content

Papers | 更新时间：2024-06-05

- Information extraction from massive Web pages based on node property and text content
- Journal on Communications Vol. 37, Issue 10, Pages: 9-17(2016)
- 作者机构：
  
  1. 南京邮电大学计算机学院，江苏南京 210023
  2. 江苏省无线传感网高技术研究重点实验室，江苏南京 210003
- 作者简介：
- 基金信息：
  
  The National Natural Science Foundation of China(61201163);The National Natural Science Foundation of China(61672297);Six Talent Peaks Project in Jiangsu Province(2013-JY-022);333 High Level Personnel Training Project in Jiangsu Province
- DOI：10.11959/j.issn.1000-436x.2016190
  CLC： TP393.07
- Online First：2016-10，
  
  Published：25 October 2016
- 稿件说明：
移动端阅览
Hai-yan WANG, Pan CAO. Information extraction from massive Web pages based on node property and text content[J]. Journal on Communications, 2016, 37(10): 9-17.
DOI：

Hai-yan WANG, Pan CAO. Information extraction from massive Web pages based on node property and text content[J]. Journal on Communications, 2016, 37(10): 9-17. DOI： 10.11959/j.issn.1000-436x.2016190.

摘要

为解决大数据场景下从海量Web页面中抽取有价值的信息，提出了一种基于节点属性与正文内容的海量Web信息抽取方法。将Web页面转化为DOM树表示，并提出剪枝与融合算法，对DOM树进行简化；定义DOM树节点的密度和视觉属性，根据属性值对Web页面内容进行预处理；引入MapReduce计算框架，实现海量Web信息的并行化抽取。仿真实验结果表明，提出的海量Web信息抽取方法不仅具有更好的性能，还具备较好的系统可扩展性。

Abstract

To address the problem of extracting valuable information from massive Web pages in big data environments

a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree

and a pruning and fusion algorithm was introduced to simplify the DOM tree.For each node in the DOM tree

both density property and vision property was defined and Web pages were pretreated based on these property values.A MapReduce framework was employed to realize parallel information extraction from massive Web pages.Simulation and experimental results demonstrate that the proposed extraction method can not only achieve better performance but also have higher scalability compared with other methods.

关键词

Keywords

references

GRISHMAN R Information extraction:techniques and challenges [EB/OL ] . http:// cs.nyu.edu/cs/faculty/grishman/proteus.htm,1997 http:// cs.nyu.edu/cs/faculty/grishman/proteus.htm,1997 .

李蕾 , 周延泉 , 王菁华 . 基于全信息的中文信息抽取系统及应用 [J ] . 北京邮电大学学报 , 2005 , 28 ( 6 ): 48 - 51 .

LI L , ZHOU Y Q , WANG J H . Comprehensive information based chinese information extraction system and application [J ] . Journal of Beijing University of Posts and Telecommunications , 2005 , 28 ( 6 ): 48 - 51 .

黄诗琳 , 郑小琳 , 陈德人 . 针对产品命名实体识别的半监督学习方法 [J ] . 北京邮电大学学报 , 2013 , 36 ( 2 ): 20 - 23 .

HUANG S L , ZHENG X L , CHEN D R . A semi-supervised learning method for product named entity recognition [J ] . Journal of Beijing University of Posts and Telecommunications , 2013 , 36 ( 2 ): 20 - 23 .

秦兵 , 刘安安 , 刘挺 . 无指导的中文开放式实体关系抽取 [J ] . 计算机研究与发展 , 2015 , 52 ( 5 ): 1029 - 1035 .

QIN B , LIU A A , LIU T . Unsupervised Chinese open entity relation extraction [J ] . Journal of Computer Research and Development , 2015 , 52 ( 5 ): 1029 - 1035 .

李天颍 , 刘璘 , 赵德旺 , 等 . 一种基于依存文法的需求文本策略依赖关系抽取方法 [J ] . 计算机学报 , 2013 , 31 ( 1 ): 54 - 62 .

LI T Y , LIU L , ZHAO D W , et al . Eliciting relations from requirements text based on dependency analysis [J ] . Journal of Computers , 2013 , 31 ( 1 ): 54 - 62 .

DENG C , YU S P , WEN J R . VIPS:a vision-based page segmentation [R ] // Microsoft Technical Report,MSR-TR_ 203-79 , 2003 .

NEIL A , HONG J . Visually extracting data records from the deepWeb [C ] // WWW 2013 . Rio,IEEE Press , 2013 : 1233 - 1238 .

NARWAL N , . Improving Web data extraction by noise removal [C ] // ARTCom 2013 . Bangalore,IET , 2013 : 388 - 395 .

SUN F , SONG D , LIAO L . DOM based content extraction via text density [C ] // ACM SIGIR 2011 . Beijing , 2011 : 245 - 254 .

张乃洲 , 曹薇 , 李石君 . 一种基于节点密度分割和标签传播的Web页面挖掘方法 [J ] . 计算机学报 , 2015 , 38 ( 2 ): 349 - 364 .

ZHANG N Z , CAO W , LI S J . A method based on node density segmentation and label propagation for mining Web page [J ] . Journal of Computers , 2015 , 38 ( 2 ): 349 - 364 .

WANG J B , WANG L Z , GAO W L , et al . Chinese Web content extraction based on naive bayes model [C ] // International Federation for Information Processing IFIP . 2014 : 404 - 413 .

KRISHNA S S , DATTATRAYA J S . Schema inference and data extraction from templatized Web pages [C ] // ICPC , 2015 : 1 - 6 .

BHUIYAN M A , ALHASAN M . FSM-H:frequent subgraph mining algorithm in Hadoop [C ] // Big Data . 2014 : 9 - 16 .

JIN S Y , BOULWARE D , KIMMEY D . A parallel spatial co-location mining algorithm based on MapReduce [C ] // Big Data . 2014 : 25 - 31 .

Views

1470

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Stochastic algorithm for HDFS data theft detection based on MapReduce

Stochastic gradient descent algorithm preserving differential privacy in MapReduce framework

k-means clustering method preserving differential privacy in MapReduce framework

Temperature aware energy-efficient task scheduling strategies for mapreduce

Design and application research on data service platform for big data

Related Author

Xingyuan CHEN

Binglong LI

Yuanzhao GAO

Yu FU

Yihan YU

Xiaoping WU

Yan CHEN

Xiao-ping WU

Related Institution

Third Academy,Information Engineering University

State Key Laboratory of Cryptology

Department of Information Security, Naval University of Engineering

No.61062 Troops of PLA

College of Statistics and Information,Xinjiang University of Finance and Economics

AI问答

⁰