基本思想
俗話說“物以類聚、人以群分”,拿看電影這個例子來說,如果你喜歡《蝙蝠俠》、《碟中諜》、《星際穿越》、《源代碼》等電影,另外有個人也都喜歡這些電影,而且他還喜歡《鋼鐵俠》,則很有可能你也喜歡《鋼鐵俠》這部電影。
是以說,當一個使用者 A 需要個性化推薦時,可以先找到和他興趣相似的使用者群體 G,然後把 G 喜歡的、并且 A 沒有聽說過的物品推薦給 A,這就是基于使用者的系統過濾算法。
原理
根據上述基本原理,我們可以将基于使用者的協同過濾推薦算法拆分為兩個步驟:
- 找到與目标使用者興趣相似的使用者集合
- 找到這個集合中使用者喜歡的、并且目标使用者沒有聽說過的物品推薦給目标使用者
1、發現興趣相似的使用者
通常用 Jaccard 公式或者餘弦相似度計算兩個使用者之間的相似度。設 N(u) 為使用者 u 喜歡的物品集合,N(v) 為使用者 v 喜歡的物品集合,那麼 u 和 v 的相似度是多少呢:
Jaccard 公式:
餘弦相似度:
假設目前共有4個使用者: A、B、C、D;共有5個物品:a、b、c、d、e。使用者與物品的關系(使用者喜歡物品)如下圖所示:
如何一下子計算所有使用者之間的相似度呢?為計算友善,通常首先需要建立“物品—使用者”的倒排表,如下圖所示:
然後對于每個物品,喜歡他的使用者,兩兩之間相同物品加1。例如喜歡物品 a 的使用者有 A 和 B,那麼在矩陣中他們兩兩加1。如下圖所示:
計算使用者兩兩之間的相似度,上面的矩陣僅僅代表的是公式的分子部分。以餘弦相似度為例,對上圖進行進一步計算:
到此,計算使用者相似度就大功告成,可以很直覺的找到與目标使用者興趣較相似的使用者。
2、推薦物品
首先需要從矩陣中找出與目标使用者 u 最相似的 K 個使用者,用集合 S(u, K) 表示,将 S 中使用者喜歡的物品全部提取出來,并去除 u 已經喜歡的物品。對于每個候選物品 i ,使用者 u 對它感興趣的程度用如下公式計算:
其中 rvi 表示使用者 v 對 i 的喜歡程度,因為使用的是單一行為的隐回報資料,在本例中 rvi 都是為 1,在一些需要使用者給予評分的推薦系統中,則要代入使用者評分。
舉個例子,假設我們要給 A 推薦物品,選取 K = 3 個相似使用者,相似使用者則是:B、C、D,那麼他們喜歡過并且 A 沒有喜歡過的物品有:c、e,那麼分别計算 p(A, c) 和 p(A, e):
看樣子使用者 A 對 c 和 e 的喜歡程度可能是一樣的,在真實的推薦系統中,隻要按得分排序,取前幾個物品就可以了。
python代碼
# -*- coding: utf-8 -*-
"""
Created on Wed Oct 24 17:07:29 2018
@author: Administrator
"""
import random
import math
class UserBasedCF:
def __init__(self,datafile = None):
self.datafile = datafile
self.readData()
self.splitData(3,47)
def readData(self,datafile = None):
"""
read the data from the data file which is a data set
"""
self.datafile = datafile or self.datafile
self.data = []
for line in open(self.datafile):
userid,itemid,record,_ = line.split()
self.data.append((userid,itemid,int(record)))
def splitData(self,k,seed,data=None,M = 8):
"""
split the data set
testdata is a test data set
traindata is a train set
test data set : train data set = 1:M-1
"""
self.testdata = {}
self.traindata = {}
data = data or self.data
random.seed(seed)
for user,item, record in self.data:
if random.randint(0,M) == k:
self.testdata.setdefault(user,{})
self.testdata[user][item] = record
else:
self.traindata.setdefault(user,{})
self.traindata[user][item] = record
def userSimilarity(self,train = None):
"""
One method of getting user similarity matrix
"""
train = train or self.traindata
self.userSim = dict()
for u in train.keys():
for v in train.keys():
if u == v:
continue
self.userSim.setdefault(u,{})
self.userSim[u][v] = len(set(train[u].keys()) & set(train[v].keys()))
self.userSim[u][v] /=math.sqrt(len(train[u]) * len(train[v]) *1.0)
def userSimilarityBest(self,train = None):
"""
the other method of getting user similarity which is better than above
you can get the method on page 46
In this experiment,we use this method
"""
train = train or self.traindata
self.userSimBest = dict()
item_users = dict()
for u,item in train.items():
for i in item.keys():
item_users.setdefault(i,set())
item_users[i].add(u)
user_item_count = dict()
count = dict()
for item,users in item_users.items():
for u in users:
user_item_count.setdefault(u,0)
user_item_count[u] += 1
for v in users:
if u == v:continue
count.setdefault(u,{})
count[u].setdefault(v,0)
count[u][v] += 1
for u ,related_users in count.items():
self.userSimBest.setdefault(u,dict())
for v, cuv in related_users.items():
self.userSimBest[u][v] = cuv / math.sqrt(user_item_count[u] * user_item_count[v] * 1.0)
def recommend(self,user,train = None,k = 8,nitem = 40):
train = train or self.traindata
rank = dict()
interacted_items = train.get(user,{})
for v ,wuv in sorted(self.userSimBest[user].items(),key = lambda x : x[1],reverse = True)[0:k]:
for i , rvi in train[v].items():
if i in interacted_items:
continue
rank.setdefault(i,0)
rank[i] += wuv * rvi
return dict(sorted(rank.items(),key = lambda x :x[1],reverse = True)[0:nitem])
def recallAndPrecision(self,train = None,test = None,k = 8,nitem = 10):
"""
Get the recall and precision, the method you want to know is listed
in the page 43
"""
train = train or self.traindata
test = test or self.testdata
hit = 0
recall = 0
precision = 0
for user in train.keys():
tu = test.get(user,{})
rank = self.recommend(user, train = train,k = k,nitem = nitem)
for item,_ in rank.items():
if item in tu:
hit += 1
recall += len(tu)
precision += nitem
return (hit / (recall * 1.0),hit / (precision * 1.0))
def coverage(self,train = None,test = None,k = 8,nitem = 10):
train = train or self.traindata
test = test or self.testdata
recommend_items = set()
all_items = set()
for user in train.keys():
for item in train[user].keys():
all_items.add(item)
rank = self.recommend(user, train, k = k, nitem = nitem)
for item,_ in rank.items():
recommend_items.add(item)
return len(recommend_items) / (len(all_items) * 1.0)
def popularity(self,train = None,test = None,k = 8,nitem = 10):
"""
Get the popularity
the algorithm on page 44
"""
train = train or self.traindata
test = test or self.testdata
item_popularity = dict()
for user ,items in train.items():
for item in items.keys():
item_popularity.setdefault(item,0)
item_popularity[item] += 1
ret = 0
n = 0
for user in train.keys():
rank = self.recommend(user, train, k = k, nitem = nitem)
for item ,_ in rank.items():
ret += math.log(1+item_popularity[item])
n += 1
return ret / (n * 1.0)
def testRecommend():
ubcf = UserBasedCF('data.txt')
ubcf.readData()
ubcf.splitData(4,100)
ubcf.userSimilarity()
user = "345"
rank = ubcf.recommend(user,k = 3)
for i,rvi in rank.items():
items = ubcf.testdata.get(user,{})
record = items.get(i,0)
print ("%5s: %.4f--%.4f" %(i,rvi,record))
def testUserBasedCF():
cf = UserBasedCF('data.txt')
cf.userSimilarityBest()
print ("%3s%20s%20s%20s%20s" % ('K',"recall",'precision','coverage','popularity'))
for k in [5,10,20,40,80,160]:
recall,precision = cf.recallAndPrecision( k = k)
coverage = cf.coverage(k = k)
popularity = cf.popularity(k = k)
print ("%3d%19.3f%%%19.3f%%%19.3f%%%20.3f" % (k,recall * 100,precision * 100,coverage * 100,popularity))
if __name__ == "__main__":
testUserBasedCF()
data
1 111 2.5
1 222 3.5
1 333 3.0
1 444 3.5
1 555 2.5
1 666 3.0
2 111 3.0
2 222 3.5
2 333 1.5
2 444 5.0
2 666 3.0
2 555 3.5
3 111 2.5
3 222 3.0
3 444 3.5
3 666 4.0
4 222 3.5
4 333 3.0
4 666 4.5
4 444 4.0
4 555 2.5
5 111 3.0
5 222 4.0
5 333 2.0
5 444 3.0
5 666 3.0
5 555 2.0
6 111 3.0
6 222 4.0
6 666 3.0
6 444 5.0
6 555 3.5
7 222 4.5
7 555 1.0
7 444 4.0
另一份代碼(手撸過的)
def read_data(filename):
"""
從檔案中讀取資料
"""
data = []
for line in open(filename):
userid, itemid, record = line.split()
data.append([userid, itemid, float(record)])
print('data:', data)
return data
import random
def split_data(M, K, seed, data):
"""
M:将資料集分為多少分
K:[0,M]之間的随機整數
seed:種子,随意
data:待劃分資料集
"""
traindata = {}
testdata = {}
random.seed(seed)
for user, item, record in data:
if random.randint(0, M) == K:
testdata.setdefault(user, {})
testdata[user][item] = record
else:
traindata.setdefault(user, {})
traindata[user][item] = record
return traindata, testdata
import math
def UserSimilarity_1(train):
"""
計算使用者之間的相似度
"""
W = {}
for u in train.keys():
for v in train.keys():
if u == v: continue
W.setdefault(u, {})
W[u][v] = len(set(train[u].keys()) & set(train[v].keys()))
W[u][v] /= math.sqrt(len(train[u]) * len(train[v] * 1.0))
return W
def UserSimilarity_2(train):
"""
建立物品使用者倒排表
"""
item_users = {}
for u, items in train.items():
for i in items.keys():
if i not in item_users: item_users.setdefault(i, set())
item_users[i].add(u)
"""
建立稀疏矩陣C[u][v]
"""
C = {}
N = {}
for item, users in item_users.items():
for u in users:
N.setdefault(u, 0)
N[u] += 1
for v in users:
if u == v: continue
C.setdefault(u, {})
C[u].setdefault(v, 0)
C[u][v] += 1
print('N:', N)
print('C:', C)
"""
計算相似度矩陣
"""
W = {}
for u, related_users in C.items():
W.setdefault(u, {})
for v, cuv in related_users.items():
W[u][v] = cuv / math.sqrt(N[u] * N[v])
print('W:', W)
return W
from operator import itemgetter
def recommend(user, train, W, k, nitem):
"""
user:給user推薦物品
train:訓練集
W:相似度矩陣
k:推薦k個相似使用者
nitem:推薦nitem個物品
"""
rank = {}
interacted_items = train[user].keys()
for v, wuv in sorted(W[user].items(), key = itemgetter(1), reverse = True)[0:k]:
for i, rvi in train[v].items():
if i in interacted_items: continue
rank.setdefault(i, 0)
rank[i] += wuv * rvi
print('rank', dict(sorted(rank.items(), key = itemgetter(1), reverse = True)[0:nitem]))
return dict(sorted(rank.items(), key = itemgetter(1), reverse = True)[0:nitem])
def recall_and_precision(train, test, k, nitem):
"""
召回率和精度
"""
hit = 0
recall = 0
precision = 0
for user in train.keys():
tu = test[user]
rank = recommend(user, train, k, nitem)
for item, pui in rank.items():
if item in tu: hit += 1
recall += len(tu)
precision += nitem
return (hit / (recall * 1.0)), (hit / (precision * 1.0))
def coverage(train, test, k, nitem):
"""
覆寫率
"""
recommend_items = set()
all_items = set()
for user in train.keys():
for item in train[user].keys():
all_items.add(item)
rank = recommend(user, train, k, nitem)
for item in rank.keys():
recommend_items.add(item)
return len(recommend_items) / (len(all_items) * 1.0)
def popularity(train, test, k, nitem):
"""
新穎度
"""
item_popularity = {}
for user, items in train.items():
for item in items.keys():
if item not in item_popularity: item_popularity.setdefault(item, 0)
item_popularity[item] += 1
ret = 0
n = 0
for user in train.keys():
rank = recommend(user, train, k, nitem)
for item in rank.keys():
ret += math.log(1 + item_popularity[item])
n += 1
ret /= n * 1.0
return ret
if __name__=='__main__':
data = read_data('data.txt')
print('\n')
train, test = split_data(8, 2, 1, data)
print('\n')
W = UserSimilarity_2(train)
print('\n')
rank = recommend('6', train, W, 3, 5)
('data:', [['1', '111', 2.5], ['1', '222', 3.5], ['1', '333', 3.0], ['1', '444', 3.5], ['1', '555', 2.5], ['1', '666', 3.0], ['2', '111', 3.0], ['2', '222', 3.5], ['2', '333', 1.5], ['2', '444', 5.0], ['2', '666', 3.0], ['2', '555', 3.5], ['3', '111', 2.5], ['3', '222', 3.0], ['3', '444', 3.5], ['3', '666', 4.0], ['4', '222', 3.5], ['4', '333', 3.0], ['4', '666', 4.5], ['4', '444', 4.0], ['4', '555', 2.5], ['5', '111', 3.0], ['5', '222', 4.0], ['5', '333', 2.0], ['5', '444', 3.0], ['5', '666', 3.0], ['5', '555', 2.0], ['6', '111', 3.0], ['6', '222', 4.0], ['6', '666', 3.0], ['6', '444', 5.0], ['6', '555', 3.5], ['7', '222', 4.5], ['7', '555', 1.0], ['7', '444', 4.0]])
('N:', {'1': 5, '3': 4, '2': 6, '5': 6, '4': 4, '7': 2, '6': 3})
('C:', {'1': {'3': 3, '2': 5, '5': 5, '4': 3, '7': 2, '6': 3}, '3': {'1': 3, '2': 4, '5': 4, '4': 2, '7': 1, '6': 3}, '2': {'1': 5, '3': 4, '5': 6, '4': 4, '7': 2, '6': 3}, '5': {'1': 5, '3': 4, '2': 6, '4': 4, '7': 2, '6': 3}, '4': {'1': 3, '3': 2, '2': 4, '5': 4, '7': 1, '6': 1}, '7': {'1': 2, '3': 1, '2': 2, '5': 2, '4': 1, '6': 1}, '6': {'1': 3, '3': 3, '2': 3, '5': 3, '4': 1, '7': 1}})
('W:', {'1': {'3': 0.6708203932499369, '2': 0.9128709291752769, '5': 0.9128709291752769, '4': 0.6708203932499369, '7': 0.6324555320336759, '6': 0.7745966692414834}, '3': {'1': 0.6708203932499369, '2': 0.8164965809277261, '5': 0.8164965809277261, '4': 0.5, '7': 0.35355339059327373, '6': 0.8660254037844387}, '2': {'1': 0.9128709291752769, '3': 0.8164965809277261, '5': 1.0, '4': 0.8164965809277261, '7': 0.5773502691896258, '6': 0.7071067811865476}, '5': {'1': 0.9128709291752769, '3': 0.8164965809277261, '2': 1.0, '4': 0.8164965809277261, '7': 0.5773502691896258, '6': 0.7071067811865476}, '4': {'1': 0.6708203932499369, '3': 0.5, '2': 0.8164965809277261, '5': 0.8164965809277261, '7': 0.35355339059327373, '6': 0.2886751345948129}, '7': {'1': 0.6324555320336759, '3': 0.35355339059327373, '2': 0.5773502691896258, '5': 0.5773502691896258, '4': 0.35355339059327373, '6': 0.4082482904638631}, '6': {'1': 0.7745966692414834, '3': 0.8660254037844387, '2': 0.7071067811865476, '5': 0.7071067811865476, '4': 0.2886751345948129, '7': 0.4082482904638631}})
('rank', {'555': 4.411365407256625, '333': 3.3844501795042716, '444': 6.566622819178273})