使用 Word2Vec 完成基于文本相似度的推薦
之前的基于文本相似度的推薦使用的是one-hot的詞向量,雖然可以使用稀疏向量來存儲裡面的非0值,但是以這種形式的詞向量存在很多問題:
- 稀疏的向量表達效率是不高的,需要降低詞向量的次元
- 難以發現詞之間的關系,以及難以捕捉句子結構和語義之間的關系
使用 Word2Vec 可以得到每個詞對應的詞向量,它是一個稠密向量,每一維的浮點數的數值大小,表示它與另一個詞的相近程度,語義相近的詞會被映射到相近的集合空間上
基于Word2Vec 的古詩推薦
import pandas as pd
df = pd.read_csv("/home/liang/Desktop/python_file/source.csv")
df = df[["poemId", "poemTitle", "poemContent"]]
df
poemId | poemTitle | poemContent | |
---|---|---|---|
1 | 關雎 | 關關雎鸠,在河之洲。窈窕淑女,君子好逑。\n參差荇菜,左右流之。窈窕淑女,寤寐求之。\n求之... | |
1 | 2 | 葛覃 | 葛之覃兮,施于中谷,維葉萋萋。黃鳥于飛,集于灌木,其鳴喈喈。\n葛之覃兮,施于中谷,維葉莫莫... |
2 | 3 | 卷耳 | 采采卷耳,不盈頃筐。嗟我懷人,寘彼周行。\n陟彼崔嵬,我馬虺隤。我姑酌彼金罍,維以不永懷。\... |
3 | 4 | 樛木 | 南有樛木,葛藟累之。樂隻君子,福履綏之。\n南有樛木,葛藟荒之。樂隻君子,福履将之。\n南有... |
4 | 5 | 螽斯 | 螽斯羽,诜诜兮。宜爾子孫,振振兮。\n螽斯羽,薨薨兮。宜爾子孫,繩繩兮。\n螽斯羽,揖揖兮。... |
... | ... | ... | ... |
72406 | 73277 | 題八詠樓 | 千古風流八詠樓,江山留與後人愁。\n水通南國三千裡,氣壓江城十四州。 |
72407 | 73278 | 偶成 | 十五年前花月底,相從曾賦賞花詩。\n今看花月渾相似,安得情懷似往時。 |
72408 | 73279 | 江行 | 暝色蒹葭外,蒼茫旅眺情。\n殘雪和雁斷,新月帶潮生。\n天到水中盡,舟随樹杪行。\n離家今幾... |
72409 | 73280 | 芙蓉池作 | 乘辇夜行遊,逍遙步西園。\n雙渠相溉灌,嘉木繞通川。\n卑枝拂羽蓋,修條摩蒼天。\n驚風扶輪... |
72410 | 73281 | 晚泊 | 半世無歸似轉蓬,今年作夢到巴東。\n身遊萬死一生地,路入千峰百嶂中。\n鄰舫有時來乞火,叢祠... |
72411 rows × 3 columns
from pyspark.sql import SQLContext, SparkSession
import numpy as np
import os
os.environ['JAVA_HOME'] = "/usr/local/src/jdk1.8.0_172"
os.environ["SPARK_HOME"] = "/usr/local/src/spark-2.2.0-bin-hadoop2.6"
os.environ["PYTHONPATH"] = "/home/liang/miniconda3/bin/python"
spark = SparkSession.builder.appName("abc").getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
values = df.values.tolist()
columns = df.columns.tolist()
item_info = spark.createDataFrame(values, columns)
item_info.show()
+------+---------+--------------------+
|poemId|poemTitle| poemContent|
+------+---------+--------------------+
| 1| 關雎|關關雎鸠,在河之洲。窈窕淑女,君子...|
| 2| 葛覃|葛之覃兮,施于中谷,維葉萋萋。黃鳥...|
| 3| 卷耳|采采卷耳,不盈頃筐。嗟我懷人,寘彼...|
| 4| 樛木|南有樛木,葛藟累之。樂隻君子,福履...|
| 5| 螽斯|螽斯羽,诜诜兮。宜爾子孫,振振兮。...|
| 6| 桃夭|桃之夭夭,灼灼其華。之子于歸,宜其...|
| 7| 兔罝|肅肅兔罝,椓之丁丁。赳赳武夫,公侯...|
| 8| 芣苢|采采芣苢,薄言采之。
采采芣苢,薄...|
| 9| 漢廣|南有喬木,不可休思。
漢有遊女,不...|
| 10| 汝墳|遵彼汝墳,伐其條枚。未見君子,惄如...|
| 11| 麟之趾|麟之趾,振振公子,于嗟麟兮。
麟之...|
| 12| 鵲巢|維鵲有巢,維鸠居之。之子于歸,百兩...|
| 13| 采蘩|于以采蘩?于沼于沚。于以用之?公侯...|
| 14| 草蟲|喓喓草蟲,趯趯阜螽。未見君子,憂心...|
| 15| 采蘋|于以采蘋?南澗之濱。于以采藻?于彼...|
| 16| 甘棠|蔽芾甘棠,勿翦勿伐,召伯所茇。
蔽...|
| 17| 行露|厭浥行露,豈不夙夜,謂行多露。
誰...|
| 18| 羔羊|羔羊之皮,素絲五紽。退食自公,委蛇...|
| 19| 殷其雷|殷其雷,在南山之陽。何斯違斯,莫敢...|
| 20| 摽有梅|摽有梅,其實七兮。求我庶士,迨其吉...|
+------+---------+--------------------+
only showing top 20 rows
from pyspark.sql.functions import concat_ws
# 合并古詩的标題和正文
sentence_df = item_info.select("poemId",
concat_ws(",",
item_info.poemTitle,
item_info.poemContent,
).alias("concat_string")
)
sentence_df.show()
+------+--------------------+
|poemId| concat_string|
+------+--------------------+
| 1|關雎,關關雎鸠,在河之洲。窈窕淑女...|
| 2|葛覃,葛之覃兮,施于中谷,維葉萋萋...|
| 3|卷耳,采采卷耳,不盈頃筐。嗟我懷人...|
| 4|樛木,南有樛木,葛藟累之。樂隻君子...|
| 5|螽斯,螽斯羽,诜诜兮。宜爾子孫,振...|
| 6|桃夭,桃之夭夭,灼灼其華。之子于歸...|
| 7|兔罝,肅肅兔罝,椓之丁丁。赳赳武夫...|
| 8|芣苢,采采芣苢,薄言采之。
采采芣...|
| 9|漢廣,南有喬木,不可休思。
漢有遊...|
| 10|汝墳,遵彼汝墳,伐其條枚。未見君子...|
| 11|麟之趾,麟之趾,振振公子,于嗟麟兮...|
| 12|鵲巢,維鵲有巢,維鸠居之。之子于歸...|
| 13|采蘩,于以采蘩?于沼于沚。于以用之...|
| 14|草蟲,喓喓草蟲,趯趯阜螽。未見君子...|
| 15|采蘋,于以采蘋?南澗之濱。于以采藻...|
| 16|甘棠,蔽芾甘棠,勿翦勿伐,召伯所茇...|
| 17|行露,厭浥行露,豈不夙夜,謂行多露...|
| 18|羔羊,羔羊之皮,素絲五紽。退食自公...|
| 19|殷其雷,殷其雷,在南山之陽。何斯違...|
| 20|摽有梅,摽有梅,其實七兮。求我庶士...|
+------+--------------------+
only showing top 20 rows
import re
import pyspark.sql.functions as F
from pyspark.sql.types import *
# 替換裡邊的特殊字元
def _filter(arg):
arg = re.sub('[\n\r()。,、?!,]', '', arg)
return arg
use_reg = F.udf(_filter, StringType())
sentence_df = sentence_df.select(sentence_df.poemId, \
use_reg(sentence_df.concat_string).alias("all_words"))
sentence_df.show()
+------+--------------------+
|poemId| all_words|
+------+--------------------+
| 1|關雎關關雎鸠在河之洲窈窕淑女君子好...|
| 2|葛覃葛之覃兮施于中谷維葉萋萋黃鳥于...|
| 3|卷耳采采卷耳不盈頃筐嗟我懷人寘彼周...|
| 4|樛木南有樛木葛藟累之樂隻君子福履綏...|
| 5|螽斯螽斯羽诜诜兮宜爾子孫振振兮螽斯...|
| 6|桃夭桃之夭夭灼灼其華之子于歸宜其室...|
| 7|兔罝肅肅兔罝椓之丁丁赳赳武夫公侯幹...|
| 8|芣苢采采芣苢薄言采之采采芣苢薄言有...|
| 9|漢廣南有喬木不可休思漢有遊女不可求...|
| 10|汝墳遵彼汝墳伐其條枚未見君子惄如調...|
| 11|麟之趾麟之趾振振公子于嗟麟兮麟之定...|
| 12|鵲巢維鵲有巢維鸠居之之子于歸百兩禦...|
| 13|采蘩于以采蘩于沼于沚于以用之公侯之...|
| 14|草蟲喓喓草蟲趯趯阜螽未見君子憂心忡...|
| 15|采蘋于以采蘋南澗之濱于以采藻于彼行...|
| 16|甘棠蔽芾甘棠勿翦勿伐召伯所茇蔽芾甘...|
| 17|行露厭浥行露豈不夙夜謂行多露誰謂雀...|
| 18|羔羊羔羊之皮素絲五紽退食自公委蛇委...|
| 19|殷其雷殷其雷在南山之陽何斯違斯莫敢...|
| 20|摽有梅摽有梅其實七兮求我庶士迨其吉...|
+------+--------------------+
only showing top 20 rows
import jieba
def get_words(partitions):
stop_words = ['而', '何', '乎', '乃', '其', '且', '若', '所', '為', '焉', '以',
'因', '于', '與','也','則','者','之','不','自','得','一','來','去',
'無', '可', '是', '已', '此', '的', '上', '中', '兮', '三', '汝', '非']
def cut_sentence(sentence):
return [i for i in jieba.cut(sentence, cut_all=True) if i not in stop_words]
for row in partitions:
yield row.poemId, cut_sentence(row.all_words)
sentence_df = sentence_df.rdd.mapPartitions(get_words).toDF(["poemId", "word_list"])
sentence_df.show()
+------+--------------------+
|poemId| word_list|
+------+--------------------+
| 1|[關, 雎, 關關, 關關雎, 雎...|
| 2|[葛, 覃, 葛, 覃, 施, 中...|
| 3|[卷, 耳, 采采, 卷, 耳, ...|
| 4|[樛, 木, 南, 有, 樛, 木...|
| 5|[螽斯, 螽斯, 羽, 诜, 诜,...|
| 6|[桃, 夭, 桃之夭夭, 灼灼, ...|
| 7|[兔, 罝, 肅, 肅, 兔, 罝...|
| 8|[芣, 苢, 采采, 芣, 苢, ...|
| 9|[漢, 廣南, 有, 喬木, 不可...|
| 10|[墳, 遵, 彼, 墳, 伐, 條...|
| 11|[麟, 趾, 麟, 趾, 振振, ...|
| 12|[鵲巢, 維, 鵲, 有, 巢, ...|
| 13|[采, 蘩, 采, 蘩, 沼, 沚...|
| 14|[草蟲, 喓, 喓, 草蟲, 趯,...|
| 15|[采, 蘋, 采, 蘋, 南澗, ...|
| 16|[甘, 棠, 蔽, 芾, 甘, 棠...|
| 17|[行, 露, 厭, 浥, 行, 露...|
| 18|[羔羊, 羊羔, 羔羊, 皮, 素...|
| 19|[殷, 雷, 殷, 雷, 在, 南...|
| 20|[摽, 有, 梅, 摽, 有, 梅...|
+------+--------------------+
only showing top 20 rows
# 要計算TFIDF值,需要知道每個詞的詞頻和反文檔頻率
from pyspark.ml.feature import CountVectorizer
# 統計詞頻
cv = CountVectorizer(inputCol="word_list", outputCol="word_frequency", minDF=1.0)
cv_model = cv.fit(sentence_df)
cv_result= cv_model.transform(sentence_df)
cv_result.show()
+------+--------------------+--------------------+
|poemId| word_list| word_frequency|
+------+--------------------+--------------------+
| 1|[關, 雎, 關關, 關關雎, 雎...|(69204,[9,94,262,...|
| 2|[葛, 覃, 葛, 覃, 施, 中...|(69204,[7,13,53,7...|
| 3|[卷, 耳, 采采, 卷, 耳, ...|(69204,[2,5,13,56...|
| 4|[樛, 木, 南, 有, 樛, 木...|(69204,[3,46,91,1...|
| 5|[螽斯, 螽斯, 羽, 诜, 诜,...|(69204,[408,460,5...|
| 6|[桃, 夭, 桃之夭夭, 灼灼, ...|(69204,[3,7,120,1...|
| 7|[兔, 罝, 肅, 肅, 兔, 罝...|(69204,[100,1170,...|
| 8|[芣, 苢, 采采, 芣, 苢, ...|(69204,[3,594,550...|
| 9|[漢, 廣南, 有, 喬木, 不可...|(69204,[0,1,3,7,9...|
| 10|[墳, 遵, 彼, 墳, 伐, 條...|(69204,[13,15,18,...|
| 11|[麟, 趾, 麟, 趾, 振振, ...|(69204,[137,506,6...|
| 12|[鵲巢, 維, 鵲, 有, 巢, ...|(69204,[3,7,46,12...|
| 13|[采, 蘩, 采, 蘩, 沼, 沚...|(69204,[7,28,242,...|
| 14|[草蟲, 喓, 喓, 草蟲, 趯,...|(69204,[15,22,108...|
| 15|[采, 蘋, 采, 蘋, 南澗, ...|(69204,[3,14,61,6...|
| 16|[甘, 棠, 蔽, 芾, 甘, 棠...|(69204,[798,1016,...|
| 17|[行, 露, 厭, 浥, 行, 露...|(69204,[13,14,36,...|
| 18|[羔羊, 羊羔, 羔羊, 皮, 素...|(69204,[137,474,7...|
| 19|[殷, 雷, 殷, 雷, 在, 南...|(69204,[7,9,55,71...|
| 20|[摽, 有, 梅, 摽, 有, 梅...|(69204,[3,13,123,...|
+------+--------------------+--------------------+
only showing top 20 rows
print(len(cv_model.vocabulary))
cv_model.vocabulary
69204
['',
' ',
'人',
'有',
'春',
'雲',
'花',
'歸',
'月',
'在',
'君',
'時',
'風',
'我',
'誰',
'見',
'日',
'玉',
'如',
...]
from pyspark.ml.feature import IDF
idf = IDF(inputCol="word_frequency", outputCol="IDF_value")
idfModel = idf.fit(cv_result)
rescaledData = idfModel.transform(cv_result)
rescaledData.select("word_list", "IDF_value").show()
+--------------------+--------------------+
| word_list| IDF_value|
+--------------------+--------------------+
|[關, 雎, 關關, 關關雎, 雎...|(69204,[9,94,262,...|
|[葛, 覃, 葛, 覃, 施, 中...|(69204,[7,13,53,7...|
|[卷, 耳, 采采, 卷, 耳, ...|(69204,[2,5,13,56...|
|[樛, 木, 南, 有, 樛, 木...|(69204,[3,46,91,1...|
|[螽斯, 螽斯, 羽, 诜, 诜,...|(69204,[408,460,5...|
|[桃, 夭, 桃之夭夭, 灼灼, ...|(69204,[3,7,120,1...|
|[兔, 罝, 肅, 肅, 兔, 罝...|(69204,[100,1170,...|
|[芣, 苢, 采采, 芣, 苢, ...|(69204,[3,594,550...|
|[漢, 廣南, 有, 喬木, 不可...|(69204,[0,1,3,7,9...|
|[墳, 遵, 彼, 墳, 伐, 條...|(69204,[13,15,18,...|
|[麟, 趾, 麟, 趾, 振振, ...|(69204,[137,506,6...|
|[鵲巢, 維, 鵲, 有, 巢, ...|(69204,[3,7,46,12...|
|[采, 蘩, 采, 蘩, 沼, 沚...|(69204,[7,28,242,...|
|[草蟲, 喓, 喓, 草蟲, 趯,...|(69204,[15,22,108...|
|[采, 蘋, 采, 蘋, 南澗, ...|(69204,[3,14,61,6...|
|[甘, 棠, 蔽, 芾, 甘, 棠...|(69204,[798,1016,...|
|[行, 露, 厭, 浥, 行, 露...|(69204,[13,14,36,...|
|[羔羊, 羊羔, 羔羊, 皮, 素...|(69204,[137,474,7...|
|[殷, 雷, 殷, 雷, 在, 南...|(69204,[7,9,55,71...|
|[摽, 有, 梅, 摽, 有, 梅...|(69204,[3,13,123,...|
+--------------------+--------------------+
only showing top 20 rows
array([ 0.15462493, 0.16263591, 1.66914546, ..., 10.49698013,
10.49698013, 10.49698013])
keywords_list_with_idf = list(zip(cv_model.vocabulary, idfModel.idf.toArray()))
keywords_list_with_idf
[('', 0.15462493099127095),
(' ', 0.16263591402670433),
('人', 1.6691454619329604),
('有', 1.799717916778991),
('春', 1.7968822184651163),
('雲', 1.8183485926421212),
('花', 1.9757949172501144),
('歸', 1.966772001442157),
('月', 1.9429729653969354),
('在', 2.016036071194773),
('君', 2.165393886305136),
('時', 2.0749771255234006),
('風', 2.066434745245324),
...]
from functools import partial
def _tfidf(partition, kw_list):
for row in partition:
words_length = len(set(row.word_list)) # 統計文檔中單詞總數
for index in row.word_frequency.indices:
word, idf = kw_list[int(index)]
tf = row.word_frequency[int(index)]/words_length # 計算TF值
tfidf = float(tf)*float(idf) # 計算該詞的TFIDF值
yield row.poemId, word, tfidf
# 使用partial為函數預定義要傳入的參數
tfidf = partial(_tfidf, kw_list=keywords_list_with_idf)
keyword_tfidf = cv_result.rdd.mapPartitions(tfidf)
keyword_tfidf = keyword_tfidf.toDF(["poemId","keyword", "tfidf"])
keyword_tfidf.show()
+------+-------+--------------------+
|poemId|keyword| tfidf|
+------+-------+--------------------+
| 1| 在|0.057601030605564936|
| 1| 思| 0.07921524240863861|
| 1| 流| 0.09759119429122505|
| 1| 關| 0.10977667106706135|
| 1| 采| 0.1198127377805513|
| 1| 不得| 0.11853473276288638|
| 1| 君子| 0.14131315456663138|
| 1| 參差| 0.40177176553544663|
| 1| 友| 0.1353286374743755|
| 1| 服| 0.13801581262352733|
| 1| 窈窕| 0.6544337720537399|
| 1| 左右| 0.4968428369265834|
| 1| 鐘鼓| 0.1849036696914497|
| 1| 悠哉| 0.3774377046757293|
| 1| 菜| 0.5753820425157967|
| 1| 荇| 0.5824866990705768|
| 1| 琴瑟| 0.2073800479363288|
| 1| 關關| 0.21292736263464188|
| 1| 寤寐| 0.45325890128424867|
| 1| 雎| 0.24643365580097992|
+------+-------+--------------------+
only showing top 20 rows
+------+-------+------------------+
|poemId|keyword| tfidf|
+------+-------+------------------+
| 8| 苢| 7.347886090955122|
| 8| 芣| 6.706482578643214|
| 8| 薄言|5.1751067718205785|
| 46406| 耶| 4.948944803640301|
| 36| 式微| 4.750594433822543|
| 8| 采采| 4.598260071527804|
| 84| 萚|4.4562877042617925|
| 47827| 阿房| 4.178456983219809|
| 610| 沒了| 4.037300049975342|
| 11| 麟| 3.801857875632987|
| 48| 彊| 3.759347136507112|
| 18| 委蛇| 3.759347136507112|
| 5| 螽斯| 3.483887053840631|
| 46611| 丹徒| 3.450071181213719|
| 46555| 舉子| 3.367015435302422|
| 46616| 徐聞|3.2679443164586477|
| 47675| 段幹木|3.1935631326872445|
| 48| 奔|3.1468240211335554|
| 65160| 囗| 3.132789280422593|
| 70885| 蓮葉| 3.114219717233316|
+------+-------+------------------+
only showing top 20 rows
from pyspark.ml.feature import Word2Vec
word2Vec = Word2Vec(vectorSize=10, inputCol="word_list", outputCol="model")
model = word2Vec.fit(sentence_df)
vectors = model.getVectors()
vectors.show()
+----+--------------------+
|word| vector|
+----+--------------------+
| 半日閑|[-0.0561875551939...|
| 箭頭|[0.12245123833417...|
| 臨風|[-0.2168009430170...|
| 琴書|[-0.2475093156099...|
| 邱|[0.05485768243670...|
| 蓮社|[0.09266380965709...|
| 人物|[-0.0869999676942...|
| 長幼|[0.18391117453575...|
| 冉|[0.60794192552566...|
| 石道|[0.09541463851928...|
| 婺|[0.52122080326080...|
| 瀉出|[-0.0349056571722...|
| 黃泉|[-0.0684615746140...|
| 本源|[-0.0193176493048...|
| 自養|[0.06587809324264...|
| 吾國|[-0.0718164145946...|
| 命作|[0.02833748422563...|
| 相憐|[1.22657394967973...|
| 疏狂|[-0.0754516199231...|
| 溱|[0.28383380174636...|
+----+--------------------+
only showing top 20 rows
# 合并
df1 = keyword_tfidf.join(vectors, keyword_tfidf.keyword==vectors.word, "inner")
df1.show()
+------+-------+--------------------+----+--------------------+
|poemId|keyword| tfidf|word| vector|
+------+-------+--------------------+----+--------------------+
| 1| 在|0.057601030605564936| 在|[-0.0991405770182...|
| 1| 思| 0.07921524240863861| 思|[-0.2970317900180...|
| 1| 流| 0.09759119429122505| 流|[-0.0167482625693...|
| 1| 關| 0.10977667106706135| 關|[0.00378159806132...|
| 1| 采| 0.1198127377805513| 采|[-0.0207617487758...|
| 1| 不得| 0.11853473276288638| 不得|[-0.0765498951077...|
| 1| 君子| 0.14131315456663138| 君子|[-0.0510748177766...|
| 1| 參差| 0.40177176553544663| 參差|[0.17438517510890...|
| 1| 友| 0.1353286374743755| 友|[-0.1573343873023...|
| 1| 服| 0.13801581262352733| 服|[0.43713137507438...|
| 1| 窈窕| 0.6544337720537399| 窈窕|[0.03826018050312...|
| 1| 左右| 0.4968428369265834| 左右|[0.15486374497413...|
| 1| 鐘鼓| 0.1849036696914497| 鐘鼓|[0.12827932834625...|
| 1| 悠哉| 0.3774377046757293| 悠哉|[-0.2879377007484...|
| 1| 菜| 0.5753820425157967| 菜|[0.29936558008193...|
| 1| 荇| 0.5824866990705768| 荇|[0.49636653065681...|
| 1| 琴瑟| 0.2073800479363288| 琴瑟|[0.10872857272624...|
| 1| 關關| 0.21292736263464188| 關關|[0.02008588984608...|
| 1| 寤寐| 0.45325890128424867| 寤寐|[0.22766476869583...|
| 1| 雎| 0.24643365580097992| 雎|[0.00680365599691...|
+------+-------+--------------------+----+--------------------+
only showing top 20 rows
# print(keyword_tfidf.count())
# print(df1.count())
# 使用每個詞的權重乘以沒個詞的詞向量
df2 = df1.rdd.map(lambda r:(r.poemId, r.keyword, r.tfidf*r.vector)).toDF(["poemId", "keyword","vector"])
df2.show()
+------+-------+--------------------+
|poemId|keyword| vector|
+------+-------+--------------------+
| 1| 在|[-0.0057105994110...|
| 1| 思|[-0.0235294452493...|
| 1| 流|[-0.0016344829464...|
| 1| 關|[4.15131246485710...|
| 1| 采|[-0.0024875219619...|
| 1| 不得|[-0.0090738213596...|
| 1| 君子|[-0.0072175436189...|
| 1| 參差|[0.07006303968671...|
| 1| 友|[-0.0212918482614...|
| 1| 服|[0.06033104195413...|
| 1| 窈窕|[0.02503875424612...|
| 1| 左右|[0.07694294239002...|
| 1| 鐘鼓|[0.02371931855677...|
| 1| 悠哉|[-0.1086785448600...|
| 1| 菜|[0.17224957892647...|
| 1| 荇|[0.28912690197140...|
| 1| 琴瑟|[0.02254813662401...|
| 1| 關關|[0.00427683555109...|
| 1| 寤寐|[0.10319108292020...|
| 1| 雎|[0.00167664982013...|
+------+-------+--------------------+
only showing top 20 rows
# 分組求出權重詞向量的平均值
# 建立臨時表
df2.registerTempTable("tempTable")
def map(row):
x = 0
for v in row.vectors:
x += v
# 将平均向量作為sku的向量
return row.poemId, x/len(row.vectors)
# collect_set 是 hive 裡面的方法
group_vector = spark.sql("select poemId, collect_set(vector) vectors from tempTable group by poemId").rdd.map(map).toDF(["poemId", "vector"])
# 計算兩首古詩之間的餘弦相似度
v1 = group_vector.where("poemId=1").select("vector").first().vector
v2 = group_vector.where("poemId=2").select("vector").first().vector
np.dot(v1,v2)/(np.linalg.norm(v1)*(np.linalg.norm(v2)))
0.378789818447398
group_df
poemId | vector | |
---|---|---|
26 | [0.008604375373175262, 0.025260454274981203, -... | |
1 | 29 | [0.017804960879639398, 0.04290195911035433, 0.... |
2 | 474 | [0.008394777557195702, 0.025892256004034084, 0... |
3 | 964 | [4.312995210865666e-05, 0.04748191617755371, -... |
4 | 1677 | [0.014743320294977227, -0.031115929525233794, ... |
... | ... | ... |
72406 | 73102 | [0.007128699103552195, 0.008001016583454229, 0... |
72407 | 73148 | [0.018791418906944357, -0.026507278938189598, ... |
72408 | 73179 | [-0.002067103872106531, -0.021090942436392375,... |
72409 | 73240 | [-0.005216105655755288, -0.018684123660531675,... |
72410 | 73245 | [0.0012883507518144945, -0.005777893401733697,... |
72411 rows × 2 columns
0.004892947822319301
def cosine_similarity(v1, v2):
return np.dot(v1,v2)/(np.linalg.norm(v1)*(np.linalg.norm(v2)))
# for i in range(1, 73282):
# v1 = group_df[group_df["poemId"] == 1]["vector"].values[0]
# for j in range(1, 73282):
# if i != j:
# cosine_similarity(group_df[group_df["poemId"] == 1]["vector"].values[0], group_df[group_df["poemId"] == 2]["vector"].values[0])
DenseVector([0.0302, 0.032, -0.0349, 0.0162, -0.0161, -0.0174, -0.0619, 0.0057, -0.0074, 0.0124])
73281
group_df[group_df["poemId"] == 49]["vector"].values
array([], dtype=object)
none_list = []
for i in range(1, 73282):
if not group_df[group_df["poemId"] == i]["vector"].values:
none_list.append(i)
(72411, 870)
from tqdm import tqdm
import random
content_base_dict = {}
# 現根據标簽篩選一部分
for i in tqdm(exist_id):
v1 = group_df[group_df["poemId"] == i]["vector"].values[0]
# random_id = random.choices(exist_id, k=1000)改成從标簽中篩選
store_list = []
for j in random_id:
if i != j:
v2 = group_df[group_df["poemId"] == j]["vector"].values[0]
sim = cosine_similarity(v1, v2)
store_list.append((j, sim))
value = sorted(store_list, key=lambda x:x[1], reverse=True)[:100]
content_base_dict[i] = value
break
poem_to_tag = pd.read_csv("/home/liang/Desktop/python_file/source.csv")
poem_to_tag = poem_to_tag[["poemId", "poemDynasty", "poemAuthorId", "poemTagNames"]]
poem_to_tag
poemId | poemDynasty | poemAuthorId | poemTagNames | |
---|---|---|---|---|
1 | 先秦 | 古詩三百首,國中古詩,詩經,愛情 | ||
1 | 2 | 先秦 | 詩經,寫人 | |
2 | 3 | 先秦 | 詩經,懷人 | |
3 | 4 | 先秦 | 詩經,祝福 | |
4 | 5 | 先秦 | 詩經,寫鳥,祝福 | |
... | ... | ... | ... | ... |
72406 | 73277 | 宋代 | 536 | 歌頌,古人,傷懷,國家 |
72407 | 73278 | 宋代 | 536 | 抒情,追憶,思念 |
72408 | 73279 | 宋代 | 196 | 羁旅,抒情,思鄉 |
72409 | 73280 | 魏晉 | 616 | 荷花,寫景,夜晚,心情 |
72410 | 73281 | 宋代 | 272 | 晚上,乘船,寫景,抒情 |
72411 rows × 4 columns
poem_to_tag["poemAuthorId"] += 100000
poem_to_tag
poemId | poemDynasty | poemAuthorId | poemTagNames | |
---|---|---|---|---|
1 | 先秦 | 100000 | 古詩三百首,國中古詩,詩經,愛情 | |
1 | 2 | 先秦 | 100000 | 詩經,寫人 |
2 | 3 | 先秦 | 100000 | 詩經,懷人 |
3 | 4 | 先秦 | 100000 | 詩經,祝福 |
4 | 5 | 先秦 | 100000 | 詩經,寫鳥,祝福 |
... | ... | ... | ... | ... |
72406 | 73277 | 宋代 | 100536 | 歌頌,古人,傷懷,國家 |
72407 | 73278 | 宋代 | 100536 | 抒情,追憶,思念 |
72408 | 73279 | 宋代 | 100196 | 羁旅,抒情,思鄉 |
72409 | 73280 | 魏晉 | 100616 | 荷花,寫景,夜晚,心情 |
72410 | 73281 | 宋代 | 100272 | 晚上,乘船,寫景,抒情 |
72411 rows × 4 columns
import pymongo
client = pymongo.MongoClient("localhost", 27017)
mycol = client['RS']
houxuanji = mycol['online_recall']
from tqdm import tqdm
import random
content_base_dict = {}
def findFromMongoDB(collection, tag):
temp_set = set()
query_dict = {
"cateId":tag,
}
found_json = collection.find(query_dict)
for _json in found_json:
items = _json.get("incluted")
items_len = _json.get("len")
if items_len < 100: # 每個類别随機選取100個
pass
else:
items = random.choices(items, k=100)
temp_set = temp_set.union(items)
return temp_set
# 現根據标簽篩選一部分
for i in tqdm(exist_id):
v1 = group_df[group_df["poemId"] == i]["vector"].values[0]
# 古詩id -> 古詩标簽 -> 根據标簽進行篩選
temp = poem_to_tag[poem_to_tag["poemId"] == i]
tags = temp["poemTagNames"].values[0]
poemId_set = set()
if tags is not np.nan:
tags = tags.split(',')
for tag in tags:
ids = findFromMongoDB(houxuanji, tag)
poemId_set = poemId_set.union(ids)
dynasty = temp["poemDynasty"].values[0]
author = temp["poemAuthorId"].values[0]
ids1 = findFromMongoDB(houxuanji, dynasty)
ids2 = findFromMongoDB(houxuanji, int(author))
poemId_set = poemId_set.union(ids1)
poemId_set = poemId_set.union(ids2)
print("length:",len(poemId_set))
store_list = []
for j in poemId_set:
if i != j:
v2 = group_df[group_df["poemId"] == j]["vector"].values[0]
sim = cosine_similarity(v1, v2)
store_list.append((j, sim))
value = sorted(store_list, key=lambda x:x[1], reverse=True)[:100]
content_base_dict[i] = [tup[0] for tup in value]
'古詩三百首,國中古詩,詩經,愛情'
content_base_dict
{1: [48013,
48005,
42372,
606,
21,
70919,
6525,
69012,
48107,
47991,
15,
47642,
47682,
10521,
72453,
47981,
34,
47873,
70886,
69221,
127,
45792,
70933,
137,
7734,
69077,
10374,
71030,
2369,
47685,
47692,
42412,
2271,
70978,
7,
68055,
68709,
20285,
69086,
47708,
69068,
48147,
9849,
47728,
71708,
9098,
78,
71043,
30793,
39490,
69053,
47813,
42180,
71002,
70940,
70897,
9315,
172,
69071,
6606,
47044,
70999,
21173,
39469,
30,
223,
59,
67634,
16,
69055,
45951,
5749,
151,
47696,
24030,
71203,
213,
71780,
48134,
150,
69088,
62369,
71205,
69067,
159,
21137,
58,
47733,
59106,
47966,
52809,
67872,
111,
6137,
70765,
69142,
71646,
48041,
67880,
4933]}
import pickle
import pickle
[117,
53,
181,
47848,
72271,
32543,
186,
72503,
47667,
72504,
47698,
71161,
47860,
47963,
69979,
47841,
72196,
96,
71128,
185,
71968,
10470,
72474,
85,
165,
47711,
42342,
273,
155,
267,
14,
33,
72518,
69403,
100,
32515,
72662,
72962,
7914,
153,
72404,
71154,
71165,
114,
69481,
109,
72236,
251,
162,
274,
224,
11266,
71124,
21281,
184,
112,
262,
192,
95,
64,
71643,
62,
47799,
51,
221,
47665,
17,
226,
46660,
303,
231,
32970,
58,
72245,
160,
23,
72180,
46977,
47700,
47814,
55,
72844,
18309,
47834,
253,
261,
147,
46401,
210,
47788,
168,
71268,
72213,
83,
47824,
47783,
72373,
61,
286,
18220]