GONG Cai-chun1, HE Min1, CHEN Hai-qiang1, et al. Frequent-pattern discovering algorithm for large-scale corpus[J]. 2007, (12): 161-166.DOI:
大规模语料的频繁模式快速发现算法
摘要
提出了一种大规模语料频繁模式快速发现算法
通过采用合适的策略将语料划分为若干子语料
对每个子语料单独进行处理
即可获得原始语料的频繁模式;同时该算法能够避免处理频次在设定阈值以下的模式
进一步减少了内存占用
提高了处理速度。实验表明
对3.6G互联网新闻语料发现频次大于100的所有频繁模式中最高消耗内存为1.6GB
单机平均每秒处理文本语料3.28M。
Abstract
A memory-based frequent-pattern discovering algorithm for large-scale corpus was presented.First
the origi-nal corpus was partitioned into several parts using appropriate dividing policy.Then each partition was processed inde-pendently to produce a temporary result
and the union of all temporary results is the final frequent-pattern set.The algo-rithm prunes a subtree once it is sure that none of the corresponding pattern will be frequent.Experiment shows that it takes no more than 1.6 gigabytes of memory to discover all patterns appearing more than 100 times for a 3.6 gigabytes news corpus