天天看點

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

Efficient Pattern Mining Methods

@(Pattern Discovery in Data Mining)

本文介紹了幾個模式挖掘的高效算法。主要以Apriori思想為架構,主要講解了FP-Growth算法。

The Downward Closure Property of Frequent Patterns

  1. Property

    The downward closure (also called “Apriori”) property of frequent patterns:

    • If {beer, diaper, nuts} is frequent, so is {beer, diaper}
    • Every transaction containing {beer, diaper, nuts} also contains {beer, diaper}
    • Apriori: Any subset of a frequent itemset must be frequent
  2. Efficient mining methodology
    • If any subset of an itemset S is infrequent, then there is no chance for S to

      be frequent — why do we even have to consider S!? (It is an efficient way to prune)

  3. Principle

    Apriori pruning principle: If there is any itemset which is infrequent, its superset should not even be generated! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)

  4. Scalable mining Methods
    • Level-wise, join-based approach: Apriori (Agrawal &[email protected]’94)
    • Vertical data format approach: Eclat (Zaki, Parthasarathy, Ogihara, Li @KDD’97)
    • Frequent pattern projection and growth: FPgrowth (Han, Pei, Yin @SIGMOD’00)

The Apriori Algorithm

  1. Outline of Apriori (level-wise, candidate generation and test)
    • Initially, scan DB once to get frequent 1-itemset
    • Repeat
      • Generate length-(k+1) candidate itemsets from length-k frequent itemsets
      • Test the candidates against DB to find frequent (k+1)-itemsets
      • Set k := k +1
    • Until no frequent or candidate set can be generated
    • Return all the frequent itemsets derived
  2. Psuedo Code
    Efficient Pattern Mining MethodsEfficient Pattern Mining Methods
  3. Tricks

    joining & pruning

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods
Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

這裡,對于某此疊代産生的joining結果,檢驗任何一個k-1的子集是否在候選集 Ck 中。即上圖pruning的過程。

Extensions or Improvements of Apriori

  • Reduce passes of transaction database scans
    • Partitioning (e.g., Savasere, et al., 1995)
    • Dynamic itemset counting (Brin, et al., 1997) —> one of Google’s cofounder
  • Shrink the number of candidates
    • Hashing (e.g., DHP: Park, et al., 1995)
    • Pruning by support lower bounding (e.g., Bayardo 1998)  Sampling (e.g., Toivonen, 1996)
  • Exploring special data structures
    • Tree projection (Aggarwal, et al., 2001)
    • H-miner (Pei, et al., 2001)
    • Hypecube decomposition (e.g., LCM: Uno, et al., 2004)
Efficient Pattern Mining MethodsEfficient Pattern Mining Methods
Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

Mining Frequent Patterns by Exploring Vertical Data Format

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

FPGrowth: A Frequent Pattern-Growth Approach

  1. 構造FP-Tree,快速疊代生成frequent patterns

    什麼是FP-Tree?如何構造FP-Tree?

    • 計算每個single itemset的frequency
    • 将每個個Transaction中item的根據frequency進行排序
    • 類似于字首樹,生成FP-Tree,其中每個節點代表了一個item

生成結果如下所示:

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods
  1. 生成Frequent Itemset

    利用分治的方法進行疊代計算。

    過程(設min_sup=2,以e字尾為例):

    1)得到e的字首路徑子樹

    Efficient Pattern Mining MethodsEfficient Pattern Mining Methods
    2)計算e的

    頻數

    ,判斷e是否是frequent item。方法是周遊e節點的連結清單(虛線連接配接)計算節點數目,得sup(e)=3 > 2,是以繼續下述步驟。

    3)因為e是頻繁的,找到所有以e結尾的frequent itemlist,也就是說,分拆問題,進行疊代。

    這裡我們首先需要拿到e的Conditional FP-Tree。

    4)Conditional FP-Tree的生成:

    結果:比較挫,直接看圖

    Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

    步驟:

    1 - 更新e的字首路徑子樹中的support值

    Efficient Pattern Mining MethodsEfficient Pattern Mining Methods
    2 - 删除包含e的節點
    Efficient Pattern Mining MethodsEfficient Pattern Mining Methods
    3 - 删除不頻繁(infrequent)的節點,這裡的c和d根據前述計算

    頻數

    的方法知道滿足最小support條件。至此,已經得到了關于e的Conditional FP-Tree。
    Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

    5)利用前面得到的關于e的CFPT,找到所有以de、ce、ae結尾(be不考慮因為b已經被删除)的frequent itemlist。這裡直接調用分治的過程進行遞歸。例如對于de來說,在e的CFPT中找到關于de的字首路徑子樹……得到de的CFPT。

    例如 e→de→ade , e→ce , e→ae

讨論:對于單枝字首路徑子樹,一次就能生成所有frequent patterns。例如:

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

* 此處紅框選中的子樹是m-cond. based,單枝樹為{}-f3-c3-a3.

* 這是一個疊代的過程,節點

a

産生了第二棵樹{}-f3-c3. 節點

c

産生了樹{}-f3.

然後第二棵樹中的節點

c

産生了最後一棵樹{}-f3. 節點

f

無法再産生新的樹。

* 第一棵樹是m-cond. based,産生了組合fm,cm,am

* 第二棵樹是am-cond. based,産生了fam,cam

* 第三棵樹是cm-cond. based,産生了fcm

* 最後一棵樹産生了fcam

* 是以我們可以得到并集 fm, cm, am, fam, fcm, cam, fcam。

課程習題:

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods
Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

這裡Parallel Project比較耗費空間,因為它是根據需要計算的不同的X-cond. based進行任務分割來計算的,但是比較快;而Partition則根據樹枝進行切分,這樣是真正意義上的“partition”。

Mining Closed Patterns

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

繼續閱讀