from:https://github.com/chuanconggao/PrefixSpan-py

API Usage

Alternatively, you can use the algorithms via API.

from prefixspan import PrefixSpan

db = [
[0, 1, 2, 3, 4],
[1, 1, 1, 3, 4],
[2, 1, 2, 2, 0],
[1, 1, 1, 2, 2],
] ps = PrefixSpan(db)

For details of each parameter, please refer to the PrefixSpan class in prefixspan/api.py.

设置长度限制:

ps = PrefixSpan(db)
ps.minlen = 3
ps.maxlen = 5
print("?"*66)
------------------
print(ps.frequent(2))
# [(2, [0]),
# (4, [1]),
# (3, [1, 2]),
# (2, [1, 2, 2]),
# (2, [1, 3]),
# (2, [1, 3, 4]),
# (2, [1, 4]),
# (2, [1, 1]),
# (2, [1, 1, 1]),
# (3, [2]),
# (2, [2, 2]),
# (2, [3]),
# (2, [3, 4]),
# (2, [4])] print(ps.topk(5))
# [(4, [1]),
# (3, [2]),
# (3, [1, 2]),
# (2, [1, 3]),
# (2, [1, 3, 4])] print(ps.frequent(2, closed=True)) print(ps.topk(5, closed=True)) print(ps.frequent(2, generator=True)) print(ps.topk(5, generator=True))

Closed Patterns and Generator Patterns

一个 频繁的顺序模式 是一种出现在序列数据库的至少“minsup”序列中的模式,其中 最小支持度 是用户设置的参数。

一个 频繁闭合序列模式 是一种频繁的顺序模式,使得它不包括在具有完全相同支持的另一顺序模式中。

算法如 的PrefixSpan 找到频繁的顺序模式。算法如 BIDE+找到频繁的闭合序列模式。 BIDE +通常比PrefixSpan快得多,因为它使用修剪技术来避免生成所有顺序模式。此外,闭合模式集通常比连续模式集小得多,因此BIDE +也更具存储效率。

另一个重要的事情是,闭合序列模式是所有序列模式的紧凑和无损表示。这意味着闭合序列模式的集合通常要小得多,但它是无损的,这意味着它允许恢复整个连续模式集(没有信息丢失),这非常方便。

我可以举个简单的例子。

让我们考虑4个序列:

a  b  c  d  e
a b d
b e a
b c d e

让我们说minsup = 2。

b c 是一种频繁的序列模式,因为它出现在两个序列中(它支持2)。 b c 不是一个封闭的顺序模式,因为它包含在一个更大的顺序模式中 b c d 得到同样的支持。

b c d 它也是一个支持2.它也不是一个封闭的顺序模式,因为它包含在一个更大的顺序模式中 b c d e 得到同样的支持。 b c d e 是一个封闭的顺序模式,因为它没有包含在具有相同支持的任何其他顺序模式中。

The closed patterns are much more compact due to the smaller number.

  • A pattern is closed if there is no super-pattern with the same frequency.
prefixspan-cli frequent 2 --closed test.dat

0 : 2
1 : 4
1 2 : 3
1 2 2 : 2
1 3 4 : 2
1 1 1 : 2

The generator patterns are even more compact due to both the smaller number and the shorter lengths.

  • A pattern is generator if there is no sub-pattern with the same frequency.

  • Due to the high compactness, generator patterns are useful as features for classification, etc.

prefixspan-cli frequent 2 --generator test.dat

0 : 2
1 1 : 2
2 : 3
2 2 : 2
3 : 2
4 : 2

There are patterns that are both closed and generator.

prefixspan-cli frequent 2 --closed --generator test.dat

0 : 2

备注:模式挖掘有很多算法。

SPMF offers implementations of the following data mining algorithms.

Sequential Pattern Mining

These algorithms discover sequential patterns in a set of sequences. For a good overview of sequential pattern mining algorithms, please read this survey paper.

Sequential Rule Mining

These algorithms discover sequential rules in a set of sequences.

Sequence Prediction

These algorithms predict the next symbol(s) of a sequence based on a set of training sequences

Itemset Mining

These algorithms discover interesting itemsets (sets of values) that appear in a transaction database (database records containing symbolic data). For a good overview of itemset mining, please read this survey paper.

  • algorithms for discovering frequent itemsets in a transaction database.

  • algorithms for discovering frequent closed itemsets in a transaction database.
  • algorithms for recovering all frequent itemsets from frequent closed itemsets:
    • the LevelWise algorithm (Pasquier et al., 1999) 
    • the DFI-Growth algorithm (___ et al., 2018) 
  • algorithms for discovering frequent maximal itemsets in a transaction database.
    • the FPMax algorithm (Grahne and Zhu, 2003)
    • the Charm-MFI algorithm for discovering frequent closed itemsets and maximal frequent itemsets by post-processing in a transaction database (Szathmary et al. 2006)
  • algorithms for mining frequent itemsets with multiple minimum supports
  • algorithms for mining generator itemsets in a transaction database
    • the DefMe algorithm for mining frequent generator itemsets in a transaction database (Soulet & Rioult, 2014)
    • the Pascal algorithm for mining frequent itemsets, and identifying at the same time which one are generators (Bastide et al., 2002)
    • the Zart algorithm for discovering frequent closed itemsets and their generators in a transaction database (Szathmary et al. 2007)
  • algorithms for mining rare itemsets and/or correlated itemsets in a transaction database
    • the AprioriInverse algorithm for mining perfectly rare itemsets (Koh & Roundtree, 2005)
    • the AprioriRare algorithm for mining minimal rare itemsets and frequent itemsets (Szathmary et al. 2007b)
    • the CORI algorithm for mining minimal rare correlated itemsets using the support and bond measures (Bouasker et al. 2015)
    • the RP-Growth algorithm for mining rare itemsets (Tsang et al., 2011) 
  • algorithms for performing targeted and dynamic queries about association rules and frequent itemsets.
    • the Itemset-Tree, a data structure that can be updated incrementally, and algorithms for querying it. (Kubat et al, 2003)
    • the Memory-Efficient Itemset-Tree, a data structure that can be updated incrementally, and algorithms for querying it. (Fournier-Viger, 2013powerpoint)
  • algorithms to discover frequent itemsets in a stream
    • the estDec algorithm for mining recent frequent itemsets in a data stream (Chang & Lee, 2003)
    • the estDec+ algorithm for mining recent frequent itemsets in a data stream (Shin et al., 2014)
    • the CloStream algorithm for mining frequent closed itemsets in a data stream (Yen et al, 2009)
  • the U-Apriori algorithm for mining frequent itemsets in uncertain data (Chui et al, 2007)
  • the VME algorithm for mining erasable itemsets (Deng & Xu, 2010)
  • algorithms to discover fuzzy frequent itemsets in a quantitative transaction database

Periodic Pattern Mining

These algorithms discover patterns that periodically appear in a sequence of complex events (also called a transaction database)

  • the PFPM algorithm (Fournier-Viger et al, 2016apowerpointvideo  ) for mining frequent periodic patterns in a sequence of transactions (a transaction database))
  • the PHM algorithm (Fournier-Viger et al, 2016bpowerpoint) for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information 

Episode Mining

These algorithms discover episodes that appear in a single sequence of complex events.

  • the TUP algorithm (Rathore et al., 2016) for mining the top-k high utility episodes in a sequence of complex events (a transaction database) with utility information 
  • the US-SPAN algorithm (Wu et al., 2013) for mining high utility episodes in a sequence of complex events (a transaction database) with utility information 

High-Utility Pattern Mining

These algorithms discover patterns having a high utility (importance) in different kinds of data. For a good overview of high utility itemset mining, you may read this survey paper, and the high utility-pattern mining book.

  • algorithms for mining high-utility itemsets in a transaction database having profit information

  • algorithm for efficiently mining high-utility itemsets with length constraints in a transaction database
  • algorithm for mining correlated high-utility itemsets in a transaction database
  • algorithm for mining high-utility itemsets in a transaction database containing negative unit profit values
  • algorithm for mining frequent high-utility itemsets in a transaction database
  • algorithm for mining on-shelf high-utility itemsets in a transaction database containing information about time periods of items
  • algorithm for incremental high-utility itemset mining in a transaction database
  • algorithm for mining concise representations of high-utility  itemsets in a transaction database
  • algorithm for mining the skyline high-utility itemsets in a transaction database
  • algorithm for mining the top-k high-utility itemsets in a transaction database
  • algorithms for mining the top-k high utility itemsets from a data stream with a window
  • algorithm for mining frequent skyline utility patterns in a transaction database
  • algorithm for mining quantitative high utility itemsets in a transaction database:
  • algorithm for mining high-utility sequential rules in a sequence database 
  • algorithm for mining high-utility sequential patterns in a sequence database 
    • the USPAN algorithm (Yin et al. 2012)
  • algorithm for mining high-utility probability sequential patterns in a sequence database 
  • algorithm for mining high-utility itemsets in a transaction database using evolutionary algorithms
  • algorithm for mining high average-utility itemsets in a transaction database
    • the HAUI-Miner algorithm for mining high average-utility itemsets (Lin et al, 2016)
    • the EHAUPM algorithm for mining high average-utility itemsets (Lin et al, 2017
    • the HAUI-MMAU algorithm for mining high average-utility itemsets with multiple thresholds (Lin et al, 2016)
    • the MEMU algorithm for mining high average-utility itemsets with multiple thresholds (Lin et al, 2018)
  • algorithms for mining high utility episodes in a sequence of complex events (a transaction database)
    • the TUP algorithm (Rathore et al., 2016) for mining frequent periodic patterns in a sequence of transactions (a transaction database))
    • the UP-SPAN algorithm (Wu et al., 2013) for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information 
  • algorithms for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information
  • algorithms for discovering irregular high utility itemsets (non periodic patterns) in a transaction database with utility information
    • the PHM_irregular algorithm, which is a simple variation of the PHM algorithm 
  • algorithm for discovering local high utility itemsets in a database with utility information and timestamps
  • algorithm for discovering peak high utility itemsets in a database with utility information and timestamps

Association Rule Mining

These algorithms discover interesting associations between symbols (values) in a transaction database (database records with binary attributes).

  • an algorithm for mining all association rules in a transaction database (Agrawal & Srikant, 1994)
  • an algorithm for mining all association rules with the lift measure in a transaction database (adapted from Agrawal & Srikant, 1994)
  • an algorithm for mining the IGB informative and generic basis of association rules in a transaction database (Gasmi et al., 2005)
  • an algorithm for mining perfectly sporadic association rules (Koh & Roundtree, 2005)
  • an algorithm for mining closed association rules (Szathmary et al. 2006).
  • an algorithm for mining minimal non redundant association rules (Kryszkiewicz, 1998)
  • the Indirect algorithm for mining indirect association rules (Tan et al. 2000; Tan et 2006)
  • the FHSAR algorithm for hiding sensitive association rules (Weng et al. 2008)
  • the TopKRules algorithm for mining the top-k association rules (Fournier-Viger, 2012bpowerpoint)
  • the TopKClassRules algorithm for mining the top-k class association rules (a variation of TopKRules. This latter is described in Fournier-Viger, 2012bpowerpoint)
  • the TNR algorithm for mining top-k non-redundant association rules (Fournier-Viger 2012dpowerpoint)

Stream pattern mining

These algorithms discovers various kinds of patterns in a stream (an infinite sequence of database records (transactions))

  • the estDec algorithm for mining recent frequent itemsets in a data stream (Chang & Lee, 2003)
  • the estDec+ algorithm for mining recent frequent itemsets in a data stream (Shin et al., 2014)
  • the CloStream algorithm for mining frequent closed itemsets in a data stream (Yen et al, 2009)
  • algorithms for mining the top-k high utility itemsets from a data stream with a window

Clustering

These algorithms automatically find clusters in different kinds of data

  • the original K-Means algorithm (MacQueen, 1967)
  • the Bisecting K-Means algorithm (Steinbach et al, 2000)
  • algorithms for density-based clustering
    • the DBScan algorithm (Ester et al., 1996)
    • the Optics algorithm to extract a cluster ordering of points, which can then be use to generate DBScan style clusters and more (Ankerst et al, 1999)
  • hierarchical clustering algorithm
  • a tool called Cluster Viewer for visualizing clusters
  • a tool called Instance Viewer for visualizing the input of clustering algorithms

Time series mining

These algorithms perform various tasks to analyze time series data

    • an algorithm for converting a time series to a sequence of symbols using the SAX representation of time series. Note that if one converts a set of time series with SAX, he will obtain a sequence database, which allows to then apply traditional algorihtms for sequential rule mining and sequential pattern mining on time series (SAX, 2007).
    • algorithms for calculating the prior moving average of a time series (to remove noise)
    • algorithms for calculating the cumulative moving average f a time series (to remove noise)
    • algorithms for calculating the central moving average of a time series (to remove noise)
    • an algorithm for calculating the median smoothing of a time series (to remove noise)
    • an algorithm for calculating the exponential smoothing of a time series (to remove noise) 
    • an algorithm for calculating the min max normalization of a time series 
    • an algorithm for calculating the autocorrelation function of a time series 
    • an algorithm for calculating the standardization of a time series 
    • an algorithm for calculating the first and second order differencing of a time series
    • an algorithm for calculating the piecewise aggregate approximation of a time series (to reduce the number of data points of a time series)
    • an algorithm for calculating the linear regression of a time series (using the least squares method) 
    • an algorithm for splitting a time series into segments of a given length
    • an algorithm for splitting a time series into a given number of segments
    • algorithms to cluster time series (group time-series according to their similarities). This can be done by applying the clustering algorithms offered in SPMF (K-Means, Bisecting K-Means, DBScan, OPTICS, Hierarchical clustering) on time series.
    • a tool called Time Series Viewer for visualizing time series 
 

prefixspan python的更多相关文章

  1. 数据挖掘经典算法PrefixSpan的一个简单Python实现

    前言 用python实现了一个没有库依赖的"纯" py-based PrefixSpan算法. Github 仓库 https://github.com/Holy-Shine/Pr ...

  2. 用Spark学习FP Tree算法和PrefixSpan算法

    在FP Tree算法原理总结和PrefixSpan算法原理总结中,我们对FP Tree和PrefixSpan这两种关联算法的原理做了总结,这里就从实践的角度介绍如何使用这两个算法.由于scikit-l ...

  3. Python中的多进程与多线程(一)

    一.背景 最近在Azkaban的测试工作中,需要在测试环境下模拟线上的调度场景进行稳定性测试.故而重操python旧业,通过python编写脚本来构造类似线上的调度场景.在脚本编写过程中,碰到这样一个 ...

  4. Python高手之路【六】python基础之字符串格式化

    Python的字符串格式化有两种方式: 百分号方式.format方式 百分号的方式相对来说比较老,而format方式则是比较先进的方式,企图替换古老的方式,目前两者并存.[PEP-3101] This ...

  5. Python 小而美的函数

    python提供了一些有趣且实用的函数,如any all zip,这些函数能够大幅简化我们得代码,可以更优雅的处理可迭代的对象,同时使用的时候也得注意一些情况   any any(iterable) ...

  6. JavaScript之父Brendan Eich,Clojure 创建者Rich Hickey,Python创建者Van Rossum等编程大牛对程序员的职业建议

    软件开发是现时很火的职业.据美国劳动局发布的一项统计数据显示,从2014年至2024年,美国就业市场对开发人员的需求量将增长17%,而这个增长率比起所有职业的平均需求量高出了7%.很多人年轻人会选择编 ...

  7. 可爱的豆子——使用Beans思想让Python代码更易维护

    title: 可爱的豆子--使用Beans思想让Python代码更易维护 toc: false comments: true date: 2016-06-19 21:43:33 tags: [Pyth ...

  8. 使用Python保存屏幕截图(不使用PIL)

    起因 在极客学院讲授<使用Python编写远程控制程序>的课程中,涉及到查看被控制电脑屏幕截图的功能. 如果使用PIL,这个需求只需要三行代码: from PIL import Image ...

  9. Python编码记录

    字节流和字符串 当使用Python定义一个字符串时,实际会存储一个字节串: "abc"--[97][98][99] python2.x默认会把所有的字符串当做ASCII码来对待,但 ...

随机推荐

  1. mescroll在vue中的应用

    1.npm install --save mescroll.js 2. <template> <div> <!--全部--> <mescroll-vue re ...

  2. [js]js中4种无节操的预解释情况

    js中4种无节操的预解释情况 - 1. if语句即使条件不成立,条件里的表达式也会进行预解释. - 2. 匿名函数的预解释: 只对等号左边与解释 - 3. 自执行函数的预解释: 不进行预就解释, 执行 ...

  3. [js] 渲染树构建、布局及绘制

    渲染树构建.布局及绘制

  4. 【UML】NO.54.EBook.6.UML.2.002-【Thinking In UML 大象 第二版】- UML 核心元素

    1.0.0 Summary Tittle:[UML]NO.54.EBook.6.UML.2.002-[Thinking In UML 大象 第二版]- UML 核心元素 Style:DesignPat ...

  5. pandas apply 添加进度条

    Way:from tqdm import tqdmimport pandas as pdtqdm.pandas(desc='pandas bar')df['title_content'] = df.p ...

  6. Oracle 24角色管理

    了解什么是角色 Oracle角色(role)就是一组权限(或者说是权限的集合). 用户可以给角色赋予指定的权限,然后将角色赋给相应的用户. 三种标准的角色 connect(连接角色) 拥有connec ...

  7. python_的面向对象编程

    废话不多说,先弄个对象来看看 class Student(object): def __init__(self, name, score): self.name = name self.score = ...

  8. 中文WebFont解决方案Font-Spider(字蛛)

    我们在日常需求中,经常会碰到视觉设计师对某个中文字体效果非常坚持的情况,因为页面是否高大上,字体选择是很重要的一个因素,选择合适的字体可以让页面更优雅.面对这种问题,我们通常以下方式来进行设计还原: ...

  9. [macOS] Error: /usr/local must be writable!" (Sierra 10.12 )

    Error: /usr/local must be writable!" (Sierra 10.12 ) solution: sudo chown -R $(whoami) /usr/loc ...

  10. (转载)关于管理计算机\\xp1 找不到网络路径的解决方案

    关于管理计算机\\xp1 找不到网络路径的解决方案 使用域管理员登录域控DC,然后打开AD用户和计算机 选择一台域成员计算机,然后选择管理,结果出现如下提示:点击确定后出现如下提示随后,立刻用域管理员 ...