推薦系統-電影推薦

用Apriori算法挖掘關聯特征

頻繁項集（frequent itemset）

FP-growth：頻繁項集挖掘算法（比Apriori有改進）

Eclat:（比Apriori有改進）

挖掘親和性分析所用的關聯規則之前，我們先用Apriori算法生成頻繁項集，再通過檢測頻繁項集中前提和結論的組合，生成關聯規則。

（1）為Apriori算法指定一個項內建為頻繁項集所需的最小支援度；（2）找到頻繁項集後，根據置信度選取關聯規則。

## 1、背景介紹根據Grouplens團隊的電影資料，做電影推薦。 ## 2、擷取資料資料下載下傳位址為為http://grouplens.org/datasets/movielens/ ，包含了100萬條電影資料。下載下傳後，解壓到檔案夾。

#import os
#data_folder = os.path.join(os.path.expanduser("~"),"ml_20m")
#ratings_filename = os.path.join(data_folder, "u.data")

## 3、加載資料 ratings為csv格式檔案，表頭為：userId,movieId,rating,timestamp。

import pandas as pd
all_ratings = pd.read_csv('ratings.csv')
all_ratings['timestamp'] = pd.to_datetime(all_ratings['timestamp'], unit='s') # 将時間戳資料轉換為時間格式資料
print all_ratings.head() #看看資料長什麼樣的
print all_ratings.describe()

print all_ratings[all_ratings['userId'] == ].sort('movieId')#檢視下第100名使用者的打分情況


# **************************Apriori算法的實作*****************************************
all_ratings['Favorable'] = all_ratings['rating'] >  #評價大于3，标記為喜歡
print all_ratings[:]

print all_ratings[all_ratings['userId'] == ].head() #檢視ID為100的使用者的評價

ratings = all_ratings[all_ratings['userId'].isin(range())] #篩選前200名使用者
favorable_ratings = ratings[ratings['Favorable']] #建立一個隻包括使用者喜歡某部電影的資料集

#需要知道每個使用者各喜歡哪些電影，按照ID進行分組，并周遊每個使用者看過的每一部電影。
favorable_reviews_by_users = dict((k, frozenset(v.values))
                                  for k, v in favorable_ratings.groupby('userId')['movieId']) #frozenset()為不可變集合
#把v.values存儲為frozenset，便于快速判斷使用者是否為某部電影打過分。集合比清單快。
print len(favorable_reviews_by_users)

#建立一個資料框，以便于了解每部電影的影迷數量。
num_favorable_by_movie = ratings[['movieId', 'Favorable']].groupby('movieId').sum()
#檢視最受歡迎的五部電影。
num_favorable_by_movie.sort('Favorable', ascending=False)[:]

userId movieId rating timestamp 0 1 2 3.5 2005-04-02 23:53:47 1 1 29 3.5 2005-04-02 23:31:16 2 1 32 3.5 2005-04-02 23:33:39 3 1 47 3.5 2005-04-02 23:32:07 4 1 50 3.5 2005-04-02 23:29:40 userId movieId rating count 2.000026e+07 2.000026e+07 2.000026e+07 mean 6.904587e+04 9.041567e+03 3.525529e+00 std 4.003863e+04 1.978948e+04 1.051989e+00 min 1.000000e+00 1.000000e+00 5.000000e-01 25% 3.439500e+04 9.020000e+02 3.000000e+00 50% 6.914100e+04 2.167000e+03 3.500000e+00 75% 1.036370e+05 4.770000e+03 4.000000e+00 max 1.384930e+05 1.312620e+05 5.000000e+00 userId movieId rating timestamp 11049 100 14 3.0 1996-06-25 16:40:02 11050 100 25 4.0 1996-06-25 16:31:02 11051 100 32 3.0 1996-06-25 16:24:49 11052 100 39 3.0 1996-06-25 16:25:12 11053 100 50 5.0 1996-06-25 16:24:49 11054 100 70 3.0 1996-06-25 16:38:47 11055 100 161 3.0 1996-06-25 16:23:18 11056 100 162 4.0 1996-06-25 16:43:19 11057 100 185 2.0 1996-06-25 16:23:45 11058 100 194 3.0 1996-06-25 16:40:13 11059 100 223 4.0 1996-06-25 16:31:02 11060 100 235 4.0 1996-06-25 16:28:27 11061 100 260 4.0 1997-06-09 16:40:56 11062 100 265 4.0 1996-06-25 16:29:49 11063 100 288 4.0 1996-06-25 16:24:07 11064 100 293 5.0 1996-06-25 16:28:27 11065 100 296 4.0 1996-06-25 16:21:49 11066 100 318 3.0 1996-06-25 16:22:54 11067 100 329 3.0 1996-06-25 16:22:54 11068 100 337 3.0 1996-06-25 16:25:52 11069 100 339 3.0 1996-06-25 16:23:18 11070 100 342 4.0 1996-06-25 16:33:36 11071 100 344 3.0 1996-06-25 16:22:14 11072 100 356 4.0 1996-06-25 16:25:52 11073 100 427 2.0 1996-06-25 16:36:08 11074 100 431 3.0 1996-06-25 16:34:10 11075 100 434 2.0 1996-06-25 16:23:18 11076 100 435 3.0 1996-06-25 16:25:33 11077 100 471 3.0 1996-06-25 16:37:19 11078 100 481 3.0 1996-06-25 16:47:57 11079 100 500 2.0 1996-06-25 16:30:44 11080 100 508 3.0 1996-06-25 16:35:35 11081 100 527 4.0 1996-06-25 16:30:44 11082 100 535 4.0 1996-06-25 16:46:16 11083 100 538 4.0 1996-06-25 16:47:44 11084 100 562 4.0 1996-07-29 14:57:42 11085 100 586 1.0 1996-06-25 16:32:37 11086 100 587 3.0 1996-06-25 16:31:42 11087 100 589 3.0 1996-06-25 16:29:49 11088 100 593 4.0 1996-06-25 16:23:45 11089 100 608 4.0 1996-06-25 16:33:06 11090 100 610 4.0 1996-06-25 16:35:35 11091 100 673 4.0 1996-06-25 16:58:05 11092 100 680 5.0 1996-06-25 16:58:31 11093 100 708 4.0 1996-06-25 16:44:04 11094 100 728 4.0 1996-07-16 16:26:17 11095 100 778 4.0 1997-06-09 16:41:27 11096 100 780 3.0 1996-07-11 16:20:12 11097 100 1112 4.0 1996-11-13 14:12:25 11098 100 1210 4.0 1997-06-09 16:43:14 11099 100 1449 5.0 1997-06-09 16:38:17 11100 100 1527 4.0 1997-06-09 16:40:04 D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:7: FutureWarning: sort(columns=….) is deprecated, use sort_values(by=…..) userId movieId rating timestamp Favorable 10 1 293 4.0 2005-04-02 23:31:43 True 11 1 296 4.0 2005-04-02 23:32:47 True 12 1 318 4.0 2005-04-02 23:33:18 True 13 1 337 3.5 2004-09-10 03:08:29 True 14 1 367 3.5 2005-04-02 23:53:00 True userId movieId rating timestamp Favorable 11049 100 14 3.0 1996-06-25 16:40:02 False 11050 100 25 4.0 1996-06-25 16:31:02 True 11051 100 32 3.0 1996-06-25 16:24:49 False 11052 100 39 3.0 1996-06-25 16:25:12 False 11053 100 50 5.0 1996-06-25 16:24:49 True 199 D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:28: FutureWarning: sort(columns=….) is deprecated, use sort_values(by=…..)

Favorable
movieId
296	80.0
356	78.0
318	76.0
593	63.0
480	58.0

# Apripri算法專門用于查找資料集中的頻繁項。基本流程為從前一步找到的頻繁項集中找到新的備選集合，接着檢測備選集合的頻繁程度是否夠高，然後再疊代。
#（1）把各項目放到隻包含自己的項集中，生成最初的頻繁項集。隻使用達到最小支援度的項目。
#（2）查找現有頻繁項集的超集，發現新的頻繁項集，并用其生成新的備選項集。
#（3）測試新生成的備選項集的頻繁程度，如果不夠頻繁，則舍棄。如果沒有新的頻繁項集，就跳到最後一步。
#（4）儲存新發現的頻繁項集，調到步驟（2）.
#（5）傳回發現的所有頻繁項集。

# 接着，用一個函數來實作步驟（2）和（3），它接受新發現的頻繁項集，檢測頻繁程度。
from collections import defaultdict
def find_frequent_itemsets(favorable_reviews_by_users, k_l_itemsets, min_support):
    counts = defaultdict(int)
    # 周遊所有使用者和他們的打分資料
    for user, reviews in favorable_reviews_by_users.items():
        # 周遊前面找出的項集，判斷它們是否目前評分項集的子集，如果是，表明使用者已經為子集中的電影打過分。
        for itemset in k_l_itemsets:
            if itemset.issubset(reviews):
                for other_reviewed_movie in reviews - itemset:
                    current_superset = itemset | frozenset((other_reviewed_movie,))
                    counts[current_superset] += 
    # 函數最後檢測達到支援度要求的項集，看它的頻繁程度夠不，并傳回其中的頻繁項集
    return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])

import sys
frequent_itemsets = {} #初始化一個字典
min_support =  #設定最小支援度，建議每次隻改動10個百分點
# 第一步，為每部電影生成隻包含自己的項集，檢測它是否頻繁。
frequent_itemsets[] = dict((frozenset((movie_id,)), row['Favorable']) for movie_id, row in num_favorable_by_movie.iterrows()
                            if row['Favorable'] > min_support)
print "There are {} movies with more than {} favorable reviews".format(len(frequent_itemsets[]), min_support)
sys.stdout.flush()

# 建立循環，運作Ariori算法，存儲算法運作中的發現的新項集。k表示即将發現的頻繁項集的長度。
# 用鍵k-1可以從frequence_itemsets字典中擷取剛發現的頻繁項集。
# 新發現的頻繁項集以長度為鍵，将其儲存到字典中。
for k in range(, ):
    cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-],min_support)
    # 如果在上述循環中沒能找到任何新的頻繁項集，就跳出循環。
    if len(cur_frequent_itemsets) == :
        print "Did not find any frequent itemsets of length {}".format(k)
        sys.stdout.flush() # 確定代碼在運作時，把緩沖區内容輸出到終端，不宜多用，拖慢運作速度。
        break
    # 如果找到了頻繁項集，輸出資訊。
    else:
        print "I found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k)
        sys.stdout.flush()
        frequent_itemsets[k] = cur_frequent_itemsets
# 删除長度為1的項集，對生成關聯規則沒用。
del frequent_itemsets[]

print "Found a total of {0} frequent itemsets".format(sum(len(itemsets) for itemsets in frequent_itemsets.values()))

There are 11 movies with more than 50 favorable reviews I found 34 frequent itemsets of length 2 I found 49 frequent itemsets of length 3 I found 36 frequent itemsets of length 4 I found 12 frequent itemsets of length 5 I found 1 frequent itemsets of length 6 Did not find any frequent itemsets of length 7 Found a total of 132 frequent itemsets

# 抽取關聯規則
# Apriori算法運作結束後，得到一系列頻繁項集，而不是關聯規則。
# 頻繁項集是一組達到最小支援度的項目，而關聯規則由前提和結論組成。
# 從頻繁項集中抽取關聯規則，把其中幾部電影作為前提，另一部電影作為結論，組成規則：如果使用者喜歡前提中的所有電影，那麼他們也會喜歡結論中的電影。
# 每一個項集都可以用這種方式生成一條規則

# 通過周遊不同長度的頻繁項集，為每個項集生成規則
candidate_rules = []
for itemset_length, itemset_counts in frequent_itemsets.items():
    for itemset in itemset_counts.keys():
        #然後周遊項集中的每一部電影，把他作為結論。項集中的其他電影作為前提，用前提和結論組成備選規則。
        for conclusion in itemset:
            premise = itemset - set((conclusion,))
            candidate_rules.append((premise, conclusion))
# 得到了大量的備選項，檢視前五條規則。
print "There are {} candidate rules".format(len(candidate_rules))
# frozenset是作為規則前提的電影編号，後面數字表示作為結論的電影編号。
candidate_rules[:]

There are 425 candidate rules [(frozenset({47}), 50), (frozenset({50}), 47), (frozenset({318}), 480), (frozenset({480}), 318), (frozenset({356}), 480)]

# 計算每條規則的置信度。

# 分别存儲規則應驗（正例）和不适應的次數。
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
# 周遊所有使用者及其喜歡的電影資料，在這個過程中周遊每條關聯規則。
for user, reviews in favorable_reviews_by_users.items():
    for candidate_rule in candidate_rules:
        premise, conclusion = candidate_rule
        # 測試每條規則的前提對使用者是否适用，即使用者是否喜歡前提中的所有電影。
        if premise.issubset(reviews):
            correct_counts[candidate_rule] += 
        else:
            incorrect_counts[candidate_rule] += 
# 用規則應驗的次數除以前提條件出現的次數，計算每條規則的置信度。
rule_confidence = {candidate_rule: correct_counts[candidate_rule]/float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])
                  for candidate_rule in candidate_rules}
print len(rule_confidence)

425

#min_confidence = 0.9
#rule_confidence = {rule: confidence for rule,confidence in rule_confidence.items() if confidence > min_confidence}
#print len(rule_confidence)

# 對置信度字典進行排序後，輸出置信度最高的前五條規則。
from operator import itemgetter
sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(), reverse=True)

print max(sorted_confidence) #輸出最大置信度
print min(sorted_confidence) #輸出最小置信度

for index in range():
    print "Rule #{0}".format(index + )
    (premise, conclusion) = sorted_confidence[index][]
    print "Rule: If a person recommends {0} they will also recommend {1}".format(premise, conclusion)
    print "- Confidence: {0: .3f}".format(rule_confidence[(premise, conclusion)])
    print ""

((frozenset([296, 593, 50, 318, 47]), 356), 0.11055276381909548) ((frozenset([296]), 47), 0.4020100502512563) Rule #1 Rule: If a person recommends frozenset([296]) they will also recommend 527 - Confidence: 0.402 Rule #2 Rule: If a person recommends frozenset([296]) they will also recommend 2858 - Confidence: 0.402 Rule #3 Rule: If a person recommends frozenset([296]) they will also recommend 480 - Confidence: 0.402 Rule #4 Rule: If a person recommends frozenset([296]) they will also recommend 50 - Confidence: 0.402 Rule #5 Rule: If a person recommends frozenset([296]) they will also recommend 593 - Confidence: 0.402

#分析電影資料
#資料：movies
#表頭：movieId,title,genres
movie_name_data = pd.read_csv("movies.csv")
movie_name_data.head()

# 建立一個用電影編号擷取名稱的函數
def get_movie_name(movie_id):
    title_object = movie_name_data[movie_name_data['movieId'] == movie_id]['title']
    title = title_object.values[]
    return title
get_movie_name()

‘Waiting to Exhale (1995)’

# 在輸出的規則中顯示電影名稱
for index in range():
    print "Rule #{0}".format(index + )
    (premise, conclusion) = sorted_confidence[index][]
    premise_names = ", ".join(get_movie_name(idx) for idx in premise)
    conclusion_name = get_movie_name(conclusion)
    print "Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conclusion_name)
    print " - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)])
    print ""

Rule #1 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Schindler’s List (1993) - Confidence: 0.402 Rule #2 Rule: If a person recommends Pulp Fiction (1994) they will also recommend American Beauty (1999) - Confidence: 0.402 Rule #3 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Jurassic Park (1993) - Confidence: 0.402 Rule #4 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Usual Suspects, The (1995) - Confidence: 0.402 Rule #5 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Silence of the Lambs, The (1991) - Confidence: 0.402

# 評估
# 隻是簡單的看下每條規則的表現

# 抽取所有沒有訓練的資料作為測試集
test_dataset = all_ratings[~all_ratings['userId'].isin(range())]
test_favorable = test_dataset[test_dataset['Favorable']]
test_favorable_by_users = dict((k, frozenset(v.values)) for k, v in test_favorable.groupby('userId')['movieId'])
test_dataset.head()

userId	movieId	rating	timestamp	Favorable
25048	200	6	5.0	1996-08-11 12:59:30	True
25049	200	10	3.0	1996-08-11 12:53:11	False
25050	200	17	4.0	1996-08-11 12:57:25	True
25051	200	19	2.0	1996-08-11 12:54:08	False
25052	200	20	4.0	1996-08-11 13:05:27	True

# 使用測試資料計算規則應驗的數量
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
for user, reviews in test_favorable_by_users.items():
    for candidate_rule in candidate_rules:
        premise, conclusion = candidate_rule
        if premise.issubset(reviews):
            if conclusion in reviews:
                correct_counts[candidate_rule] += 
            else:
                incorrect_counts[candidate_rule] += 
print len(correct_counts)

# 計算所有應驗規則的置信度
test_confidence = {candidate_rule: correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])
                  for candidate_rule in rule_confidence}
print len(test_confidence)

sorted_test_confidence = sorted(test_confidence.items(), key=itemgetter(), reverse=True)
print sorted_test_confidence[:]

425
[((frozenset([296]), 2858), 0.4020100502512563), ((frozenset([296]), 480), 0.4020100502512563), ((frozenset([296]), 50), 0.4020100502512563), ((frozenset([296]), 593), 0.4020100502512563), ((frozenset([296]), 47), 0.4020100502512563)]

# 輸出電影名稱表示的最佳關聯規則
for index in range():
    print "Rule #{0}".format(index + )
    (premise, conclusion) = sorted_confidence[index][]
    premise_names = ",".join(get_movie_name(idx) for idx in premise)
    conlusion_name = get_movie_name(conclusion)
    print "Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conlusion_name)
    print "- Train Confidence: {0:.3f}".format(rule_confidence.get((premise, conclusion), -))
    print "- Test Confidence: {0:.3f}".format(test_confidence.get((premise, conclusion), -))
    print ""

Rule #1
Rule: If a person recommends Pulp Fiction (1994) they will also recommend Schindler's List (1993)
- Train Confidence: 0.402
- Test Confidence: 0.402

Rule #2
Rule: If a person recommends Pulp Fiction (1994) they will also recommend American Beauty (1999)
- Train Confidence: 0.402
- Test Confidence: 0.402

Rule #3
Rule: If a person recommends Pulp Fiction (1994) they will also recommend Jurassic Park (1993)
- Train Confidence: 0.402
- Test Confidence: 0.402

Rule #4
Rule: If a person recommends Pulp Fiction (1994) they will also recommend Usual Suspects, The (1995)
- Train Confidence: 0.402
- Test Confidence: 0.402

Rule #5
Rule: If a person recommends Pulp Fiction (1994) they will also recommend Silence of the Lambs, The (1991)
- Train Confidence: 0.402
- Test Confidence: 0.402

推薦系統-電影推薦

繼續閱讀

Codeforces 1417 D. Make Them Equal(思維+構造)

查找算法之二分查找查找算法之二分查找

查找算法學習之二分查找（Python版本）——BinarySearch

CQ V1.0分詞bates(基于雙數組tire樹)—應該是目前最快的中文分詞算法

Command Network(POJ 3164)---定根最小樹形圖模闆題題目描述輸入格式輸出格式輸入樣例輸出樣例分析源程式

開源低帶寬語音編解碼器

241 Different Ways to Add Parentheses（C代碼版）

【趨高機器視覺】機器視覺技術原了解析及解決方案

CSMA/CD1． CSMA/CD的概述2． CSMA 的工作原理3． CSMA/CD控制規程及特點4． CSMA/CD協定5． CSMA/CD的優點6．結束語

極大似然法(ML)與最大期望法(EM)

C++ 第十五周報告1--《冒泡法排序》

筆試面試題目：滑動視窗(二)

資料結構與算法（27）——排序（二）

Dijkstra--簡易版（最短路徑）

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

hdu7108哈希