Efficient Pattern Mining Methods

@(Pattern Discovery in Data Mining)

本文介紹了幾個模式挖掘的高效算法。主要以Apriori思想為架構，主要講解了FP-Growth算法。

The Downward Closure Property of Frequent Patterns

Property

The downward closure (also called “Apriori”) property of frequent patterns:
- If {beer, diaper, nuts} is frequent, so is {beer, diaper}
- Every transaction containing {beer, diaper, nuts} also contains {beer, diaper}
- Apriori: Any subset of a frequent itemset must be frequent
Efficient mining methodology
- If any subset of an itemset S is infrequent, then there is no chance for S to
  
  be frequent — why do we even have to consider S!? (It is an efficient way to prune)
Principle

Apriori pruning principle: If there is any itemset which is infrequent, its superset should not even be generated! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
Scalable mining Methods
- Level-wise, join-based approach: Apriori (Agrawal &[email protected]’94)
- Vertical data format approach: Eclat (Zaki, Parthasarathy, Ogihara, Li @KDD’97)
- Frequent pattern projection and growth: FPgrowth (Han, Pei, Yin @SIGMOD’00)

The Apriori Algorithm

Outline of Apriori (level-wise, candidate generation and test)
- Initially, scan DB once to get frequent 1-itemset
- Repeat
  - Generate length-(k+1) candidate itemsets from length-k frequent itemsets
  - Test the candidates against DB to find frequent (k+1)-itemsets
  - Set k := k +1
- Until no frequent or candidate set can be generated
- Return all the frequent itemsets derived
Psuedo Code

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods
Tricks

joining & pruning

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

這裡，對于某此疊代産生的joining結果，檢驗任何一個k-1的子集是否在候選集 Ck 中。即上圖pruning的過程。

Extensions or Improvements of Apriori

Reduce passes of transaction database scans
- Partitioning (e.g., Savasere, et al., 1995)
- Dynamic itemset counting (Brin, et al., 1997) —> one of Google’s cofounder
Shrink the number of candidates
- Hashing (e.g., DHP: Park, et al., 1995)
- Pruning by support lower bounding (e.g., Bayardo 1998)  Sampling (e.g., Toivonen, 1996)
Exploring special data structures
- Tree projection (Aggarwal, et al., 2001)
- H-miner (Pei, et al., 2001)
- Hypecube decomposition (e.g., LCM: Uno, et al., 2004)

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

Mining Frequent Patterns by Exploring Vertical Data Format

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

FPGrowth: A Frequent Pattern-Growth Approach

構造FP-Tree，快速疊代生成frequent patterns

什麼是FP-Tree？如何構造FP-Tree？
- 計算每個single itemset的frequency
- 将每個個Transaction中item的根據frequency進行排序
- 類似于字首樹，生成FP-Tree，其中每個節點代表了一個item

生成結果如下所示：

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

生成Frequent Itemset

利用分治的方法進行疊代計算。

過程（設min_sup=2，以e字尾為例）：

1）得到e的字首路徑子樹

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods
2）計算e的 頻數
，判斷e是否是frequent item。方法是周遊e節點的連結清單（虛線連接配接）計算節點數目，得sup(e)=3 > 2，是以繼續下述步驟。

3）因為e是頻繁的，找到所有以e結尾的frequent itemlist，也就是說，分拆問題，進行疊代。

這裡我們首先需要拿到e的Conditional FP-Tree。

4）Conditional FP-Tree的生成：

結果：比較挫，直接看圖

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

步驟：

1 - 更新e的字首路徑子樹中的support值

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods
2 - 删除包含e的節點

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods
3 - 删除不頻繁(infrequent)的節點，這裡的c和d根據前述計算 頻數 的方法知道滿足最小support條件。至此，已經得到了關于e的Conditional FP-Tree。

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

5）利用前面得到的關于e的CFPT，找到所有以de、ce、ae結尾（be不考慮因為b已經被删除）的frequent itemlist。這裡直接調用分治的過程進行遞歸。例如對于de來說，在e的CFPT中找到關于de的字首路徑子樹……得到de的CFPT。

例如 e→de→ade ， e→ce ， e→ae

讨論：對于單枝字首路徑子樹，一次就能生成所有frequent patterns。例如：

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

* 此處紅框選中的子樹是m-cond. based，單枝樹為{}-f3-c3-a3.

* 這是一個疊代的過程，節點

産生了第二棵樹{}-f3-c3. 節點

産生了樹{}-f3.

然後第二棵樹中的節點

産生了最後一棵樹{}-f3. 節點

無法再産生新的樹。

* 第一棵樹是m-cond. based，産生了組合fm，cm，am

* 第二棵樹是am-cond. based，産生了fam，cam

* 第三棵樹是cm-cond. based，産生了fcm

* 最後一棵樹産生了fcam

* 是以我們可以得到并集 fm, cm, am, fam, fcm, cam, fcam。

課程習題：

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

這裡Parallel Project比較耗費空間，因為它是根據需要計算的不同的X-cond. based進行任務分割來計算的，但是比較快；而Partition則根據樹枝進行切分，這樣是真正意義上的“partition”。

Mining Closed Patterns

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

Efficient Pattern Mining MethodsEfficient Pattern Mining Methods

Efficient Pattern Mining Methods

The Downward Closure Property of Frequent Patterns

The Apriori Algorithm

Extensions or Improvements of Apriori

Mining Frequent Patterns by Exploring Vertical Data Format

FPGrowth: A Frequent Pattern-Growth Approach

Mining Closed Patterns

繼續閱讀

AMSDK u-boot User's Guide

Agile Software Development: The Cooperative Game (2nd Edition)

Effective C#之20:Distinguish Between Implementing Interfaces and Overriding Virtual Functions

Java & Security @ JDJ

Are You Using Abstract Classes, Polymorphism, and Interfaces? @ JDJ

Java Card 2.2 Specifications Overview @ JDJ

KVO－－－Key Value Obersving

Circuit Switching and Packet Switching Networks

多态的缺陷：域和靜态方法

使用.net framework 2.0用c#編寫ActiveX控件(編寫,部署,更新更新)

學習使用DispatchAction （收集網上資料）

Chapter 2 The Software Development Process

C#協變和抗變

有關 Reliability（可靠性）的一點資料

boj 1336 簡單的問題不過自己沒想到别人解釋的學習

《代碼大全》學習筆記——第十八章，表驅動法