Frequent-pattern discovering algorithm for large-scale corpus
|更新时间:2024-10-14
|
Frequent-pattern discovering algorithm for large-scale corpus
Issue 12, Pages: 161-166(2007)
作者机构:
1. 中国科学院计算技术研究所
2. 中国科学院计算技术研究所,北京,100080
3. 中国科学院研究生院
4. ,北京,100039
作者简介:
基金信息:
DOI:
CLC:TP301.6
Published:2007
稿件说明:
移动端阅览
GONG Cai-chun1, HE Min1, CHEN Hai-qiang1, et al. Frequent-pattern discovering algorithm for large-scale corpus[J]. 2007, (12): 161-166.
DOI:
GONG Cai-chun1, HE Min1, CHEN Hai-qiang1, et al. Frequent-pattern discovering algorithm for large-scale corpus[J]. 2007, (12): 161-166.DOI:
Frequent-pattern discovering algorithm for large-scale corpus
摘要
提出了一种大规模语料频繁模式快速发现算法
通过采用合适的策略将语料划分为若干子语料
对每个子语料单独进行处理
即可获得原始语料的频繁模式;同时该算法能够避免处理频次在设定阈值以下的模式
进一步减少了内存占用
提高了处理速度。实验表明
对3.6G互联网新闻语料发现频次大于100的所有频繁模式中最高消耗内存为1.6GB
单机平均每秒处理文本语料3.28M。
Abstract
A memory-based frequent-pattern discovering algorithm for large-scale corpus was presented.First
the origi-nal corpus was partitioned into several parts using appropriate dividing policy.Then each partition was processed inde-pendently to produce a temporary result
and the union of all temporary results is the final frequent-pattern set.The algo-rithm prunes a subtree once it is sure that none of the corresponding pattern will be frequent.Experiment shows that it takes no more than 1.6 gigabytes of memory to discover all patterns appearing more than 100 times for a 3.6 gigabytes news corpus