Frequent-pattern discovering algorithm for large-scale corpus

GONG Cai-chun1; HE Min1; CHEN Hai-qiang1; XU Hong-bo1; CHENG Xue-qi1

您当前的位置：

首页 >

文章列表页 >

Frequent-pattern discovering algorithm for large-scale corpus

更新时间：2024-10-14

- Frequent-pattern discovering algorithm for large-scale corpus
- Issue 12, Pages: 161-166(2007)
- 作者机构：
  
  1. 中国科学院计算技术研究所
  2. 中国科学院计算技术研究所,北京,100080
  3. 中国科学院研究生院
  4. ,北京,100039
- 作者简介：
- 基金信息：
- DOI：
  CLC： TP301.6
- Published：2007
- 稿件说明：
移动端阅览
GONG Cai-chun1, HE Min1, CHEN Hai-qiang1, et al. Frequent-pattern discovering algorithm for large-scale corpus[J]. 2007, (12): 161-166.
DOI：

GONG Cai-chun1, HE Min1, CHEN Hai-qiang1, et al. Frequent-pattern discovering algorithm for large-scale corpus[J]. 2007, (12): 161-166. DOI：

摘要

提出了一种大规模语料频繁模式快速发现算法

通过采用合适的策略将语料划分为若干子语料

对每个子语料单独进行处理

即可获得原始语料的频繁模式;同时该算法能够避免处理频次在设定阈值以下的模式

进一步减少了内存占用

提高了处理速度。实验表明

对3.6G互联网新闻语料发现频次大于100的所有频繁模式中最高消耗内存为1.6GB

单机平均每秒处理文本语料3.28M。

Abstract

A memory-based frequent-pattern discovering algorithm for large-scale corpus was presented.First

the origi-nal corpus was partitioned into several parts using appropriate dividing policy.Then each partition was processed inde-pendently to produce a temporary result

and the union of all temporary results is the final frequent-pattern set.The algo-rithm prunes a subtree once it is sure that none of the corresponding pattern will be frequent.Experiment shows that it takes no more than 1.6 gigabytes of memory to discover all patterns appearing more than 100 times for a 3.6 gigabytes news corpus

the average speed is 3.28 magabytes per second.

关键词

Keywords

references

Views

827

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Survey of differential privacy in frequent pattern mining

Related Author

Li-ping DING

Guo-qing LU

Related Institution

National Engineering Research Center of Fundamental Software,Institute of Software,Chinese Academy of Sciences

University of Chinese Academy of Sciences

AI问答

⁰