AutoRec(Autoencoders Meet Collaborative Filtering)的閱讀了解及Pytorch代碼實戰
文章目錄
- 1.AutoRec文章簡單閱讀了解
- 2.Pytorch代碼實戰
-
- (1)AutoRec神經網絡搭建
- (2)參數訓練、測試、預估
- (3)資料初始化設定
- (4)資料載入和main函數
- 3.運作結果總結
1.AutoRec文章簡單閱讀了解
協同過濾(collaborative filtering)的實質是矩陣補全。AutoRec的辦法就是用自編碼器進行矩陣補全的操作。它的方法是将自編碼器拿來直接學習評分矩陣裡的行或列資料的壓縮向量表達,并且将其分為user-based和item-based兩種。對于前者,編碼器的輸入為評分矩陣R 裡的每列,即每個item用各個user對它的打分作為其向量描述;對于user-based 是用評分矩陣R 裡的每行。下圖是論文中item-based AutoRec的網絡結構:
從論文的資料我們可以看出來,item-based AutoRec表現優于user-based AutoRec:
給出item-based AutoRec的目标函數是:
min θ ∑ i = 1 n ∥ r i − h ( r i ; θ ) ∥ O 2 + λ 2 ⋅ ( ∥ W ∥ F 2 + ∥ V ∥ F 2 ) \min _{\theta} \sum_{i=1}^{n}\left\|\mathbf{r}^{i}-h\left(\mathbf{r}^{i} ; \theta\right)\right\|_{\mathcal{O}}^{2}+\frac{\lambda}{2} \cdot\left(\|\mathbf{W}\|_{F}^{2}+\|\mathbf{V}\|_{F}^{2}\right) θmini=1∑n∥∥ri−h(ri;θ)∥∥O2+2λ⋅(∥W∥F2+∥V∥F2)從這裡就可以看出,相比較于單純的自編碼器(AE),AutoRec方法的改進之處在于:
- 目标函數中 ∥ ⋅ ∥ O 2 \|\cdot\|_{\mathcal{O}}^{2} ∥⋅∥O2代表我們隻考慮觀察到評級的貢獻,也就是說在計算loss function時,我們隻在矩陣中存在的元素基礎上進行計算。對于未出現的missing value,我們手動賦予它一個值(例如在5分滿分的評級系統中賦分3)。
- 目标函數的後半段加入了正則化項 λ 2 ⋅ ( ∥ W ∥ F 2 + ∥ V ∥ F 2 ) \frac{\lambda}{2} \cdot\left(\|\mathbf{W}\|_{F}^{2}+\|\mathbf{V}\|_{F}^{2}\right) 2λ⋅(∥W∥F2+∥V∥F2)對觀察到的參數進行正則化處理,防止其過拟合。其中,正則化強度(regularisation strength) λ {\lambda} λ>0。
2.Pytorch代碼實戰
(1)AutoRec神經網絡搭建
#named as 'network.py'
from collections import OrderedDict
import torch
from torch import nn
# AutoEncoder
class AutoEncoder(nn.Module):
def __init__(self,hidden,dropout=0.1):
super(AutoEncoder,self).__init__()
d1 = OrderedDict()
for i in range(len(hidden)-1):
d1['enc_linear' + str(i)] = nn.Linear(hidden[i], hidden[i + 1])#nn.Linear(input,out,bias=True)
d1['enc_bn' + str(i)] = nn.BatchNorm1d(hidden[i+1])
d1['enc_drop' + str(i)] = nn.Dropout(dropout)
d1['enc_relu'+str(i)] = nn.ReLU()
self.encoder = nn.Sequential(d1)
d2 = OrderedDict()
for i in range(len(hidden)-1,0,-1):
d2['dec_linear' + str(i)] = nn.Linear(hidden[i], hidden[i - 1])
d2['dec_bn' + str(i)] = nn.BatchNorm1d(hidden[i - 1])
d2['dec_drop' + str(i)] = nn.Dropout(dropout)
d2['dec_relu' + str(i)] = nn.Sigmoid()
self.decoder = nn.Sequential(d2)
def forward(self, x):
x = self.decoder(self.encoder(x))
return x
(2)參數訓練、測試、預估
#named as 'model.py'
import torch
import math
from torch.utils.data import DataLoader
from torch.autograd import Variable
from torch import optim, nn
import torch.nn.functional as F
import network as nets
class Model:
def __init__(self, hidden, learning_rate, batch_size):
self.batch_size = batch_size
self.net = nets.AutoEncoder(hidden)
self.net
self.opt = optim.SGD(self.net.parameters(), learning_rate, momentum=0.9, weight_decay=1e-4)
self.feature_size = hidden[0]
def run(self, trainset, testlist, num_epoch):
for epoch in range(1, num_epoch + 1):
train_loader = DataLoader(trainset, self.batch_size, shuffle=True, pin_memory=True)
self.train(train_loader, epoch)
self.test(trainset, testlist)
#批訓練
def train(self, train_loader, epoch):
self.net.train()
features = Variable(torch.FloatTensor(self.batch_size, self.feature_size))
masks = Variable(torch.FloatTensor(self.batch_size, self.feature_size))
for bid, (feature, mask) in enumerate(train_loader):
if mask.shape[0] == self.batch_size:
features.data.copy_(feature)
masks.data.copy_(mask)
else:
features = Variable(feature)
masks = Variable(mask)
self.opt.zero_grad()
output = self.net(features)
loss = F.mse_loss(output* masks, features* masks)
#loss = F.mse_loss(output, features)
loss.backward()
self.opt.step()
print ("Epoch %d, train end." % epoch)
def test(self, trainset, testlist):
self.net.eval()
x_mat, mask, user_based = trainset.get_mat()
features = Variable(x_mat)
xc = self.net(features)
if not user_based:
xc = xc.t()
xc = xc.cpu().data.numpy()
rmse = 0.0
for (i, j, r) in testlist:
rmse += (xc[i][j]-r)*(xc[i][j]-r)
rmse = math.sqrt(rmse / len(testlist))
print (" Test RMSE = %f" % rmse)
(3)資料初始化設定
#named as 'setdata.py'
import numpy as np
import torch
from torch.utils import data
class Dataset(data.Dataset):
def __init__(self, rating_list, n_user, n_item, user_based=True):
self.data = rating_list
self.user_based = user_based
self.n_user = n_user
self.n_item = n_item
self.x_mat = np.ones((n_user, n_item)) * 0
self.mask = np.zeros((n_user, n_item))
for u, v, r in self.data:
self.x_mat[u][v] = r
self.mask[u][v] = 1
self.x_mat = torch.from_numpy(self.x_mat).float()
self.mask = torch.from_numpy(self.mask).float()
if not self.user_based:
self.x_mat = self.x_mat.t()
self.mask = self.mask.t()
def __getitem__(self, index):
return self.x_mat[index], self.mask[index]
def __len__(self):
if self.user_based:
return self.n_user
return self.n_item
def get_mat(self):
return self.x_mat, self.mask, self.user_based
(4)資料載入和main函數
import sys
sys.path.append('../')
import setdata as sd
import model
from datetime import datetime
import os
import sys
import numpy as np
path_prefix = '../datasets/'
def load_data(dataset='ratings', train_ratio=0.9):
fname = path_prefix+dataset+'.dat'
max_uid = 0
max_vid = 0
records = []
if not os.path.exists(fname):
print('[Error] File %s not found!' % fname)
sys.exit(-1)
first_line_flag = True
with open(fname,encoding = "ISO-8859-1") as f:
for line in f:
tks = line.strip().split('::')#把資料變成一個list
if first_line_flag:
max_uid = int(tks[0])
max_vid = int(tks[1])
first_line_flag = False
continue
max_uid = max(max_uid, int(tks[0]))
max_vid = max(max_vid, int(tks[1]))
records.append((int(tks[0]) - 1, int(tks[1]) - 1, int(tks[2])))
print("Max user ID {0}. Max item ID {1}. In total {2} ratings.".format(
max_uid, max_vid, len(records)))
np.random.shuffle(records)
train_list = records[0:int(len(records)*train_ratio)]
test_list = records[int(len(records)*train_ratio):]
return train_list, test_list, max_uid, max_vid
#__main__
# parameters
rank = 100
batch_size = 128
user_based = False
if __name__ == '__main__':
start = datetime.now()
train_list, test_list, n_user, n_item = load_data('u', 0.9)
trainset = sd.Dataset(train_list, n_user, n_item, user_based)
if user_based :
h = n_item
else:
h = n_user
mod = model.Model(hidden=[h, rank*3],
learning_rate = 0.2,
batch_size=batch_size)
mod.run(trainset, test_list, num_epoch=500)
end = datetime.now()
print ("Total time: %s" % str(end-start))
3.運作結果總結
我使用MovieLens的ml-100k和ml-1M兩個資料集進行了測試。很不幸的是都沒能達到論文裡描述的結果。我把代碼放在了自己的Github上,希望能有大牛看到後能夠指點一番。Github位址
————————————————
版權聲明:本文為CSDN部落客「StudyLess」的原創文章,遵循 CC 4.0 BY-SA 版權協定,轉載請附上原文出處連結及本聲明。
原文連結:https://blog.csdn.net/studyless/article/details/70880829