天天看點

推薦系統-電影推薦

用Apriori算法挖掘關聯特征

頻繁項集(frequent itemset)

FP-growth:頻繁項集挖掘算法(比Apriori有改進)

Eclat:(比Apriori有改進)

挖掘親和性分析所用的關聯規則之前,我們先用Apriori算法生成頻繁項集,再通過檢測頻繁項集中前提和結論的組合,生成關聯規則。

(1)為Apriori算法指定一個項內建為頻繁項集所需的最小支援度;(2)找到頻繁項集後,根據置信度選取關聯規則。

## 1、背景介紹 根據Grouplens團隊的電影資料,做電影推薦。 ## 2、擷取資料 資料下載下傳位址為為http://grouplens.org/datasets/movielens/ ,包含了100萬條電影資料。下載下傳後,解壓到檔案夾。

#import os
#data_folder = os.path.join(os.path.expanduser("~"),"ml_20m")
#ratings_filename = os.path.join(data_folder, "u.data")
           

## 3、加載資料 ratings為csv格式檔案,表頭為:userId,movieId,rating,timestamp。

import pandas as pd
all_ratings = pd.read_csv('ratings.csv')
all_ratings['timestamp'] = pd.to_datetime(all_ratings['timestamp'], unit='s') # 将時間戳資料轉換為時間格式資料
print all_ratings.head() #看看資料長什麼樣的
print all_ratings.describe()

print all_ratings[all_ratings['userId'] == ].sort('movieId')#檢視下第100名使用者的打分情況


# **************************Apriori算法的實作*****************************************
all_ratings['Favorable'] = all_ratings['rating'] >  #評價大于3,标記為喜歡
print all_ratings[:]

print all_ratings[all_ratings['userId'] == ].head() #檢視ID為100的使用者的評價

ratings = all_ratings[all_ratings['userId'].isin(range())] #篩選前200名使用者
favorable_ratings = ratings[ratings['Favorable']] #建立一個隻包括使用者喜歡某部電影的資料集

#需要知道每個使用者各喜歡哪些電影,按照ID進行分組,并周遊每個使用者看過的每一部電影。
favorable_reviews_by_users = dict((k, frozenset(v.values))
                                  for k, v in favorable_ratings.groupby('userId')['movieId']) #frozenset()為不可變集合
#把v.values存儲為frozenset,便于快速判斷使用者是否為某部電影打過分。集合比清單快。
print len(favorable_reviews_by_users)

#建立一個資料框,以便于了解每部電影的影迷數量。
num_favorable_by_movie = ratings[['movieId', 'Favorable']].groupby('movieId').sum()
#檢視最受歡迎的五部電影。
num_favorable_by_movie.sort('Favorable', ascending=False)[:]
           

userId movieId rating timestamp 0 1 2 3.5 2005-04-02 23:53:47 1 1 29 3.5 2005-04-02 23:31:16 2 1 32 3.5 2005-04-02 23:33:39 3 1 47 3.5 2005-04-02 23:32:07 4 1 50 3.5 2005-04-02 23:29:40 userId movieId rating count 2.000026e+07 2.000026e+07 2.000026e+07 mean 6.904587e+04 9.041567e+03 3.525529e+00 std 4.003863e+04 1.978948e+04 1.051989e+00 min 1.000000e+00 1.000000e+00 5.000000e-01 25% 3.439500e+04 9.020000e+02 3.000000e+00 50% 6.914100e+04 2.167000e+03 3.500000e+00 75% 1.036370e+05 4.770000e+03 4.000000e+00 max 1.384930e+05 1.312620e+05 5.000000e+00 userId movieId rating timestamp 11049 100 14 3.0 1996-06-25 16:40:02 11050 100 25 4.0 1996-06-25 16:31:02 11051 100 32 3.0 1996-06-25 16:24:49 11052 100 39 3.0 1996-06-25 16:25:12 11053 100 50 5.0 1996-06-25 16:24:49 11054 100 70 3.0 1996-06-25 16:38:47 11055 100 161 3.0 1996-06-25 16:23:18 11056 100 162 4.0 1996-06-25 16:43:19 11057 100 185 2.0 1996-06-25 16:23:45 11058 100 194 3.0 1996-06-25 16:40:13 11059 100 223 4.0 1996-06-25 16:31:02 11060 100 235 4.0 1996-06-25 16:28:27 11061 100 260 4.0 1997-06-09 16:40:56 11062 100 265 4.0 1996-06-25 16:29:49 11063 100 288 4.0 1996-06-25 16:24:07 11064 100 293 5.0 1996-06-25 16:28:27 11065 100 296 4.0 1996-06-25 16:21:49 11066 100 318 3.0 1996-06-25 16:22:54 11067 100 329 3.0 1996-06-25 16:22:54 11068 100 337 3.0 1996-06-25 16:25:52 11069 100 339 3.0 1996-06-25 16:23:18 11070 100 342 4.0 1996-06-25 16:33:36 11071 100 344 3.0 1996-06-25 16:22:14 11072 100 356 4.0 1996-06-25 16:25:52 11073 100 427 2.0 1996-06-25 16:36:08 11074 100 431 3.0 1996-06-25 16:34:10 11075 100 434 2.0 1996-06-25 16:23:18 11076 100 435 3.0 1996-06-25 16:25:33 11077 100 471 3.0 1996-06-25 16:37:19 11078 100 481 3.0 1996-06-25 16:47:57 11079 100 500 2.0 1996-06-25 16:30:44 11080 100 508 3.0 1996-06-25 16:35:35 11081 100 527 4.0 1996-06-25 16:30:44 11082 100 535 4.0 1996-06-25 16:46:16 11083 100 538 4.0 1996-06-25 16:47:44 11084 100 562 4.0 1996-07-29 14:57:42 11085 100 586 1.0 1996-06-25 16:32:37 11086 100 587 3.0 1996-06-25 16:31:42 11087 100 589 3.0 1996-06-25 16:29:49 11088 100 593 4.0 1996-06-25 16:23:45 11089 100 608 4.0 1996-06-25 16:33:06 11090 100 610 4.0 1996-06-25 16:35:35 11091 100 673 4.0 1996-06-25 16:58:05 11092 100 680 5.0 1996-06-25 16:58:31 11093 100 708 4.0 1996-06-25 16:44:04 11094 100 728 4.0 1996-07-16 16:26:17 11095 100 778 4.0 1997-06-09 16:41:27 11096 100 780 3.0 1996-07-11 16:20:12 11097 100 1112 4.0 1996-11-13 14:12:25 11098 100 1210 4.0 1997-06-09 16:43:14 11099 100 1449 5.0 1997-06-09 16:38:17 11100 100 1527 4.0 1997-06-09 16:40:04 D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:7: FutureWarning: sort(columns=….) is deprecated, use sort_values(by=…..) userId movieId rating timestamp Favorable 10 1 293 4.0 2005-04-02 23:31:43 True 11 1 296 4.0 2005-04-02 23:32:47 True 12 1 318 4.0 2005-04-02 23:33:18 True 13 1 337 3.5 2004-09-10 03:08:29 True 14 1 367 3.5 2005-04-02 23:53:00 True userId movieId rating timestamp Favorable 11049 100 14 3.0 1996-06-25 16:40:02 False 11050 100 25 4.0 1996-06-25 16:31:02 True 11051 100 32 3.0 1996-06-25 16:24:49 False 11052 100 39 3.0 1996-06-25 16:25:12 False 11053 100 50 5.0 1996-06-25 16:24:49 True 199 D:\Anaconda2\lib\site-packages\ipykernel\__main__.py:28: FutureWarning: sort(columns=….) is deprecated, use sort_values(by=…..)

Favorable
movieId
296 80.0
356 78.0
318 76.0
593 63.0
480 58.0
# Apripri算法專門用于查找資料集中的頻繁項。基本流程為從前一步找到的頻繁項集中找到新的備選集合,接着檢測備選集合的頻繁程度是否夠高,然後再疊代。
#(1)把各項目放到隻包含自己的項集中,生成最初的頻繁項集。隻使用達到最小支援度的項目。
#(2)查找現有頻繁項集的超集,發現新的頻繁項集,并用其生成新的備選項集。
#(3)測試新生成的備選項集的頻繁程度,如果不夠頻繁,則舍棄。如果沒有新的頻繁項集,就跳到最後一步。
#(4)儲存新發現的頻繁項集,調到步驟(2).
#(5)傳回發現的所有頻繁項集。

# 接着,用一個函數來實作步驟(2)和(3),它接受新發現的頻繁項集,檢測頻繁程度。
from collections import defaultdict
def find_frequent_itemsets(favorable_reviews_by_users, k_l_itemsets, min_support):
    counts = defaultdict(int)
    # 周遊所有使用者和他們的打分資料
    for user, reviews in favorable_reviews_by_users.items():
        # 周遊前面找出的項集,判斷它們是否目前評分項集的子集,如果是,表明使用者已經為子集中的電影打過分。
        for itemset in k_l_itemsets:
            if itemset.issubset(reviews):
                for other_reviewed_movie in reviews - itemset:
                    current_superset = itemset | frozenset((other_reviewed_movie,))
                    counts[current_superset] += 
    # 函數最後檢測達到支援度要求的項集,看它的頻繁程度夠不,并傳回其中的頻繁項集
    return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])
           
import sys
frequent_itemsets = {} #初始化一個字典
min_support =  #設定最小支援度,建議每次隻改動10個百分點
# 第一步,為每部電影生成隻包含自己的項集,檢測它是否頻繁。
frequent_itemsets[] = dict((frozenset((movie_id,)), row['Favorable']) for movie_id, row in num_favorable_by_movie.iterrows()
                            if row['Favorable'] > min_support)
print "There are {} movies with more than {} favorable reviews".format(len(frequent_itemsets[]), min_support)
sys.stdout.flush()

# 建立循環,運作Ariori算法,存儲算法運作中的發現的新項集。k表示即将發現的頻繁項集的長度。
# 用鍵k-1可以從frequence_itemsets字典中擷取剛發現的頻繁項集。
# 新發現的頻繁項集以長度為鍵,将其儲存到字典中。
for k in range(, ):
    cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users, frequent_itemsets[k-],min_support)
    # 如果在上述循環中沒能找到任何新的頻繁項集,就跳出循環。
    if len(cur_frequent_itemsets) == :
        print "Did not find any frequent itemsets of length {}".format(k)
        sys.stdout.flush() # 確定代碼在運作時,把緩沖區内容輸出到終端,不宜多用,拖慢運作速度。
        break
    # 如果找到了頻繁項集,輸出資訊。
    else:
        print "I found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k)
        sys.stdout.flush()
        frequent_itemsets[k] = cur_frequent_itemsets
# 删除長度為1的項集,對生成關聯規則沒用。
del frequent_itemsets[]

print "Found a total of {0} frequent itemsets".format(sum(len(itemsets) for itemsets in frequent_itemsets.values()))
           

There are 11 movies with more than 50 favorable reviews I found 34 frequent itemsets of length 2 I found 49 frequent itemsets of length 3 I found 36 frequent itemsets of length 4 I found 12 frequent itemsets of length 5 I found 1 frequent itemsets of length 6 Did not find any frequent itemsets of length 7 Found a total of 132 frequent itemsets

# 抽取關聯規則
# Apriori算法運作結束後,得到一系列頻繁項集,而不是關聯規則。
# 頻繁項集是一組達到最小支援度的項目,而關聯規則由前提和結論組成。
# 從頻繁項集中抽取關聯規則,把其中幾部電影作為前提,另一部電影作為結論,組成規則:如果使用者喜歡前提中的所有電影,那麼他們也會喜歡結論中的電影。
# 每一個項集都可以用這種方式生成一條規則

# 通過周遊不同長度的頻繁項集,為每個項集生成規則
candidate_rules = []
for itemset_length, itemset_counts in frequent_itemsets.items():
    for itemset in itemset_counts.keys():
        #然後周遊項集中的每一部電影,把他作為結論。項集中的其他電影作為前提,用前提和結論組成備選規則。
        for conclusion in itemset:
            premise = itemset - set((conclusion,))
            candidate_rules.append((premise, conclusion))
# 得到了大量的備選項,檢視前五條規則。
print "There are {} candidate rules".format(len(candidate_rules))
# frozenset是作為規則前提的電影編号,後面數字表示作為結論的電影編号。
candidate_rules[:]
           

There are 425 candidate rules [(frozenset({47}), 50), (frozenset({50}), 47), (frozenset({318}), 480), (frozenset({480}), 318), (frozenset({356}), 480)]

# 計算每條規則的置信度。

# 分别存儲規則應驗(正例)和不适應的次數。
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
# 周遊所有使用者及其喜歡的電影資料,在這個過程中周遊每條關聯規則。
for user, reviews in favorable_reviews_by_users.items():
    for candidate_rule in candidate_rules:
        premise, conclusion = candidate_rule
        # 測試每條規則的前提對使用者是否适用,即使用者是否喜歡前提中的所有電影。
        if premise.issubset(reviews):
            correct_counts[candidate_rule] += 
        else:
            incorrect_counts[candidate_rule] += 
# 用規則應驗的次數除以前提條件出現的次數,計算每條規則的置信度。
rule_confidence = {candidate_rule: correct_counts[candidate_rule]/float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])
                  for candidate_rule in candidate_rules}
print len(rule_confidence)
           

425

#min_confidence = 0.9
#rule_confidence = {rule: confidence for rule,confidence in rule_confidence.items() if confidence > min_confidence}
#print len(rule_confidence)

# 對置信度字典進行排序後,輸出置信度最高的前五條規則。
from operator import itemgetter
sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(), reverse=True)

print max(sorted_confidence) #輸出最大置信度
print min(sorted_confidence) #輸出最小置信度

for index in range():
    print "Rule #{0}".format(index + )
    (premise, conclusion) = sorted_confidence[index][]
    print "Rule: If a person recommends {0} they will also recommend {1}".format(premise, conclusion)
    print "- Confidence: {0: .3f}".format(rule_confidence[(premise, conclusion)])
    print ""
           

((frozenset([296, 593, 50, 318, 47]), 356), 0.11055276381909548) ((frozenset([296]), 47), 0.4020100502512563) Rule #1 Rule: If a person recommends frozenset([296]) they will also recommend 527 - Confidence: 0.402 Rule #2 Rule: If a person recommends frozenset([296]) they will also recommend 2858 - Confidence: 0.402 Rule #3 Rule: If a person recommends frozenset([296]) they will also recommend 480 - Confidence: 0.402 Rule #4 Rule: If a person recommends frozenset([296]) they will also recommend 50 - Confidence: 0.402 Rule #5 Rule: If a person recommends frozenset([296]) they will also recommend 593 - Confidence: 0.402

#分析電影資料
#資料:movies
#表頭:movieId,title,genres
movie_name_data = pd.read_csv("movies.csv")
movie_name_data.head()

# 建立一個用電影編号擷取名稱的函數
def get_movie_name(movie_id):
    title_object = movie_name_data[movie_name_data['movieId'] == movie_id]['title']
    title = title_object.values[]
    return title
get_movie_name()
           

‘Waiting to Exhale (1995)’

# 在輸出的規則中顯示電影名稱
for index in range():
    print "Rule #{0}".format(index + )
    (premise, conclusion) = sorted_confidence[index][]
    premise_names = ", ".join(get_movie_name(idx) for idx in premise)
    conclusion_name = get_movie_name(conclusion)
    print "Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conclusion_name)
    print " - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)])
    print ""
           

Rule #1 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Schindler’s List (1993) - Confidence: 0.402 Rule #2 Rule: If a person recommends Pulp Fiction (1994) they will also recommend American Beauty (1999) - Confidence: 0.402 Rule #3 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Jurassic Park (1993) - Confidence: 0.402 Rule #4 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Usual Suspects, The (1995) - Confidence: 0.402 Rule #5 Rule: If a person recommends Pulp Fiction (1994) they will also recommend Silence of the Lambs, The (1991) - Confidence: 0.402

# 評估
# 隻是簡單的看下每條規則的表現

# 抽取所有沒有訓練的資料作為測試集
test_dataset = all_ratings[~all_ratings['userId'].isin(range())]
test_favorable = test_dataset[test_dataset['Favorable']]
test_favorable_by_users = dict((k, frozenset(v.values)) for k, v in test_favorable.groupby('userId')['movieId'])
test_dataset.head()
           
userId movieId rating timestamp Favorable
25048 200 6 5.0 1996-08-11 12:59:30 True
25049 200 10 3.0 1996-08-11 12:53:11 False
25050 200 17 4.0 1996-08-11 12:57:25 True
25051 200 19 2.0 1996-08-11 12:54:08 False
25052 200 20 4.0 1996-08-11 13:05:27 True
# 使用測試資料計算規則應驗的數量
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
for user, reviews in test_favorable_by_users.items():
    for candidate_rule in candidate_rules:
        premise, conclusion = candidate_rule
        if premise.issubset(reviews):
            if conclusion in reviews:
                correct_counts[candidate_rule] += 
            else:
                incorrect_counts[candidate_rule] += 
print len(correct_counts)         
           
425
           
# 計算所有應驗規則的置信度
test_confidence = {candidate_rule: correct_counts[candidate_rule] / float(correct_counts[candidate_rule] + incorrect_counts[candidate_rule])
                  for candidate_rule in rule_confidence}
print len(test_confidence)

sorted_test_confidence = sorted(test_confidence.items(), key=itemgetter(), reverse=True)
print sorted_test_confidence[:]
           
425
[((frozenset([296]), 2858), 0.4020100502512563), ((frozenset([296]), 480), 0.4020100502512563), ((frozenset([296]), 50), 0.4020100502512563), ((frozenset([296]), 593), 0.4020100502512563), ((frozenset([296]), 47), 0.4020100502512563)]
           
# 輸出電影名稱表示的最佳關聯規則
for index in range():
    print "Rule #{0}".format(index + )
    (premise, conclusion) = sorted_confidence[index][]
    premise_names = ",".join(get_movie_name(idx) for idx in premise)
    conlusion_name = get_movie_name(conclusion)
    print "Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conlusion_name)
    print "- Train Confidence: {0:.3f}".format(rule_confidence.get((premise, conclusion), -))
    print "- Test Confidence: {0:.3f}".format(test_confidence.get((premise, conclusion), -))
    print ""
           
Rule #1
Rule: If a person recommends Pulp Fiction (1994) they will also recommend Schindler's List (1993)
- Train Confidence: 0.402
- Test Confidence: 0.402

Rule #2
Rule: If a person recommends Pulp Fiction (1994) they will also recommend American Beauty (1999)
- Train Confidence: 0.402
- Test Confidence: 0.402

Rule #3
Rule: If a person recommends Pulp Fiction (1994) they will also recommend Jurassic Park (1993)
- Train Confidence: 0.402
- Test Confidence: 0.402

Rule #4
Rule: If a person recommends Pulp Fiction (1994) they will also recommend Usual Suspects, The (1995)
- Train Confidence: 0.402
- Test Confidence: 0.402

Rule #5
Rule: If a person recommends Pulp Fiction (1994) they will also recommend Silence of the Lambs, The (1991)
- Train Confidence: 0.402
- Test Confidence: 0.402
           

繼續閱讀