Efficient Pattern Mining Methods
@(Pattern Discovery in Data Mining)
本文介紹了幾個模式挖掘的高效算法。主要以Apriori思想為架構,主要講解了FP-Growth算法。
The Downward Closure Property of Frequent Patterns
-
Property
The downward closure (also called “Apriori”) property of frequent patterns:
- If {beer, diaper, nuts} is frequent, so is {beer, diaper}
- Every transaction containing {beer, diaper, nuts} also contains {beer, diaper}
- Apriori: Any subset of a frequent itemset must be frequent
- Efficient mining methodology
-
If any subset of an itemset S is infrequent, then there is no chance for S to
be frequent — why do we even have to consider S!? (It is an efficient way to prune)
-
-
Principle
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not even be generated! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
- Scalable mining Methods
- Level-wise, join-based approach: Apriori (Agrawal &[email protected]’94)
- Vertical data format approach: Eclat (Zaki, Parthasarathy, Ogihara, Li @KDD’97)
- Frequent pattern projection and growth: FPgrowth (Han, Pei, Yin @SIGMOD’00)
The Apriori Algorithm
- Outline of Apriori (level-wise, candidate generation and test)
- Initially, scan DB once to get frequent 1-itemset
- Repeat
- Generate length-(k+1) candidate itemsets from length-k frequent itemsets
- Test the candidates against DB to find frequent (k+1)-itemsets
- Set k := k +1
- Until no frequent or candidate set can be generated
- Return all the frequent itemsets derived
- Psuedo Code
-
Tricks
joining & pruning
這裡,對于某此疊代産生的joining結果,檢驗任何一個k-1的子集是否在候選集 Ck 中。即上圖pruning的過程。
Extensions or Improvements of Apriori
- Reduce passes of transaction database scans
- Partitioning (e.g., Savasere, et al., 1995)
- Dynamic itemset counting (Brin, et al., 1997) —> one of Google’s cofounder
- Shrink the number of candidates
- Hashing (e.g., DHP: Park, et al., 1995)
- Pruning by support lower bounding (e.g., Bayardo 1998) Sampling (e.g., Toivonen, 1996)
- Exploring special data structures
- Tree projection (Aggarwal, et al., 2001)
- H-miner (Pei, et al., 2001)
- Hypecube decomposition (e.g., LCM: Uno, et al., 2004)
Mining Frequent Patterns by Exploring Vertical Data Format
FPGrowth: A Frequent Pattern-Growth Approach
-
構造FP-Tree,快速疊代生成frequent patterns
什麼是FP-Tree?如何構造FP-Tree?
- 計算每個single itemset的frequency
- 将每個個Transaction中item的根據frequency進行排序
- 類似于字首樹,生成FP-Tree,其中每個節點代表了一個item
生成結果如下所示:
-
生成Frequent Itemset
利用分治的方法進行疊代計算。
過程(設min_sup=2,以e字尾為例):
1)得到e的字首路徑子樹
2)計算e的頻數
,判斷e是否是frequent item。方法是周遊e節點的連結清單(虛線連接配接)計算節點數目,得sup(e)=3 > 2,是以繼續下述步驟。
3)因為e是頻繁的,找到所有以e結尾的frequent itemlist,也就是說,分拆問題,進行疊代。
這裡我們首先需要拿到e的Conditional FP-Tree。
4)Conditional FP-Tree的生成:
結果:比較挫,直接看圖
步驟:
1 - 更新e的字首路徑子樹中的support值
2 - 删除包含e的節點 3 - 删除不頻繁(infrequent)的節點,這裡的c和d根據前述計算
的方法知道滿足最小support條件。至此,已經得到了關于e的Conditional FP-Tree。頻數
5)利用前面得到的關于e的CFPT,找到所有以de、ce、ae結尾(be不考慮因為b已經被删除)的frequent itemlist。這裡直接調用分治的過程進行遞歸。例如對于de來說,在e的CFPT中找到關于de的字首路徑子樹……得到de的CFPT。
例如 e→de→ade , e→ce , e→ae
讨論:對于單枝字首路徑子樹,一次就能生成所有frequent patterns。例如:
* 此處紅框選中的子樹是m-cond. based,單枝樹為{}-f3-c3-a3.
* 這是一個疊代的過程,節點
a
産生了第二棵樹{}-f3-c3. 節點
c
産生了樹{}-f3.
然後第二棵樹中的節點
c
産生了最後一棵樹{}-f3. 節點
f
無法再産生新的樹。
* 第一棵樹是m-cond. based,産生了組合fm,cm,am
* 第二棵樹是am-cond. based,産生了fam,cam
* 第三棵樹是cm-cond. based,産生了fcm
* 最後一棵樹産生了fcam
* 是以我們可以得到并集 fm, cm, am, fam, fcm, cam, fcam。
課程習題:
這裡Parallel Project比較耗費空間,因為它是根據需要計算的不同的X-cond. based進行任務分割來計算的,但是比較快;而Partition則根據樹枝進行切分,這樣是真正意義上的“partition”。