Information extraction from massive Web pages based on node property and text content
Papers|更新时间:2024-06-05
|
Information extraction from massive Web pages based on node property and text content
Journal on CommunicationsVol. 37, Issue 10, Pages: 9-17(2016)
作者机构:
1. 南京邮电大学计算机学院,江苏 南京 210023
2. 江苏省无线传感网高技术研究重点实验室,江苏 南京 210003
作者简介:
基金信息:
The National Natural Science Foundation of China(61201163);The National Natural Science Foundation of China(61672297);Six Talent Peaks Project in Jiangsu Province(2013-JY-022);333 High Level Personnel Training Project in Jiangsu Province
Hai-yan WANG, Pan CAO. Information extraction from massive Web pages based on node property and text content[J]. Journal on Communications, 2016, 37(10): 9-17.
DOI:
Hai-yan WANG, Pan CAO. Information extraction from massive Web pages based on node property and text content[J]. Journal on Communications, 2016, 37(10): 9-17. DOI: 10.11959/j.issn.1000-436x.2016190.
Information extraction from massive Web pages based on node property and text content
To address the problem of extracting valuable information from massive Web pages in big data environments
a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree
and a pruning and fusion algorithm was introduced to simplify the DOM tree.For each node in the DOM tree
both density property and vision property was defined and Web pages were pretreated based on these property values.A MapReduce framework was employed to realize parallel information extraction from massive Web pages.Simulation and experimental results demonstrate that the proposed extraction method can not only achieve better performance but also have higher scalability compared with other methods.
关键词
Keywords
references
GRISHMAN R Information extraction:techniques and challenges [EB/OL ] . http:// cs.nyu.edu/cs/faculty/grishman/proteus.htm,1997 http:// cs.nyu.edu/cs/faculty/grishman/proteus.htm,1997 .
LI L , ZHOU Y Q , WANG J H . Comprehensive information based chinese information extraction system and application [J ] . Journal of Beijing University of Posts and Telecommunications , 2005 , 28 ( 6 ): 48 - 51 .
HUANG S L , ZHENG X L , CHEN D R . A semi-supervised learning method for product named entity recognition [J ] . Journal of Beijing University of Posts and Telecommunications , 2013 , 36 ( 2 ): 20 - 23 .
QIN B , LIU A A , LIU T . Unsupervised Chinese open entity relation extraction [J ] . Journal of Computer Research and Development , 2015 , 52 ( 5 ): 1029 - 1035 .
LI T Y , LIU L , ZHAO D W , et al . Eliciting relations from requirements text based on dependency analysis [J ] . Journal of Computers , 2013 , 31 ( 1 ): 54 - 62 .
DENG C , YU S P , WEN J R . VIPS:a vision-based page segmentation [R ] // Microsoft Technical Report,MSR-TR_ 203-79 , 2003 .
NEIL A , HONG J . Visually extracting data records from the deepWeb [C ] // WWW 2013 . Rio,IEEE Press , 2013 : 1233 - 1238 .
NARWAL N , . Improving Web data extraction by noise removal [C ] // ARTCom 2013 . Bangalore,IET , 2013 : 388 - 395 .
SUN F , SONG D , LIAO L . DOM based content extraction via text density [C ] // ACM SIGIR 2011 . Beijing , 2011 : 245 - 254 .
ZHANG N Z , CAO W , LI S J . A method based on node density segmentation and label propagation for mining Web page [J ] . Journal of Computers , 2015 , 38 ( 2 ): 349 - 364 .
WANG J B , WANG L Z , GAO W L , et al . Chinese Web content extraction based on naive bayes model [C ] // International Federation for Information Processing IFIP . 2014 : 404 - 413 .
KRISHNA S S , DATTATRAYA J S . Schema inference and data extraction from templatized Web pages [C ] // ICPC , 2015 : 1 - 6 .
BHUIYAN M A , ALHASAN M . FSM-H:frequent subgraph mining algorithm in Hadoop [C ] // Big Data . 2014 : 9 - 16 .
JIN S Y , BOULWARE D , KIMMEY D . A parallel spatial co-location mining algorithm based on MapReduce [C ] // Big Data . 2014 : 25 - 31 .