天天看點

使用Word2Vec完成基于文本相似度的推薦

使用 Word2Vec 完成基于文本相似度的推薦

之前的基于文本相似度的推薦使用的是one-hot的詞向量,雖然可以使用稀疏向量來存儲裡面的非0值,但是以這種形式的詞向量存在很多問題:

  • 稀疏的向量表達效率是不高的,需要降低詞向量的次元
  • 難以發現詞之間的關系,以及難以捕捉句子結構和語義之間的關系

使用 Word2Vec 可以得到每個詞對應的詞向量,它是一個稠密向量,每一維的浮點數的數值大小,表示它與另一個詞的相近程度,語義相近的詞會被映射到相近的集合空間上

基于Word2Vec 的古詩推薦

import pandas as pd

df = pd.read_csv("/home/liang/Desktop/python_file/source.csv")
           
df = df[["poemId", "poemTitle", "poemContent"]]
df
           
poemId poemTitle poemContent
1 關雎 關關雎鸠,在河之洲。窈窕淑女,君子好逑。\n參差荇菜,左右流之。窈窕淑女,寤寐求之。\n求之...
1 2 葛覃 葛之覃兮,施于中谷,維葉萋萋。黃鳥于飛,集于灌木,其鳴喈喈。\n葛之覃兮,施于中谷,維葉莫莫...
2 3 卷耳 采采卷耳,不盈頃筐。嗟我懷人,寘彼周行。\n陟彼崔嵬,我馬虺隤。我姑酌彼金罍,維以不永懷。\...
3 4 樛木 南有樛木,葛藟累之。樂隻君子,福履綏之。\n南有樛木,葛藟荒之。樂隻君子,福履将之。\n南有...
4 5 螽斯 螽斯羽,诜诜兮。宜爾子孫,振振兮。\n螽斯羽,薨薨兮。宜爾子孫,繩繩兮。\n螽斯羽,揖揖兮。...
... ... ... ...
72406 73277 題八詠樓 千古風流八詠樓,江山留與後人愁。\n水通南國三千裡,氣壓江城十四州。
72407 73278 偶成 十五年前花月底,相從曾賦賞花詩。\n今看花月渾相似,安得情懷似往時。
72408 73279 江行 暝色蒹葭外,蒼茫旅眺情。\n殘雪和雁斷,新月帶潮生。\n天到水中盡,舟随樹杪行。\n離家今幾...
72409 73280 芙蓉池作 乘辇夜行遊,逍遙步西園。\n雙渠相溉灌,嘉木繞通川。\n卑枝拂羽蓋,修條摩蒼天。\n驚風扶輪...
72410 73281 晚泊 半世無歸似轉蓬,今年作夢到巴東。\n身遊萬死一生地,路入千峰百嶂中。\n鄰舫有時來乞火,叢祠...

72411 rows × 3 columns

from pyspark.sql import SQLContext, SparkSession
import numpy as np
import os

os.environ['JAVA_HOME'] = "/usr/local/src/jdk1.8.0_172"
os.environ["SPARK_HOME"] = "/usr/local/src/spark-2.2.0-bin-hadoop2.6"
os.environ["PYTHONPATH"] = "/home/liang/miniconda3/bin/python"

spark = SparkSession.builder.appName("abc").getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)

values = df.values.tolist()
columns = df.columns.tolist()
item_info = spark.createDataFrame(values, columns)
item_info.show()
           
+------+---------+--------------------+
|poemId|poemTitle|         poemContent|
+------+---------+--------------------+
|     1|       關雎|關關雎鸠,在河之洲。窈窕淑女,君子...|
|     2|       葛覃|葛之覃兮,施于中谷,維葉萋萋。黃鳥...|
|     3|       卷耳|采采卷耳,不盈頃筐。嗟我懷人,寘彼...|
|     4|       樛木|南有樛木,葛藟累之。樂隻君子,福履...|
|     5|       螽斯|螽斯羽,诜诜兮。宜爾子孫,振振兮。...|
|     6|       桃夭|桃之夭夭,灼灼其華。之子于歸,宜其...|
|     7|       兔罝|肅肅兔罝,椓之丁丁。赳赳武夫,公侯...|
|     8|       芣苢|采采芣苢,薄言采之。
采采芣苢,薄...|
|     9|       漢廣|南有喬木,不可休思。
漢有遊女,不...|
|    10|       汝墳|遵彼汝墳,伐其條枚。未見君子,惄如...|
|    11|      麟之趾|麟之趾,振振公子,于嗟麟兮。
麟之...|
|    12|       鵲巢|維鵲有巢,維鸠居之。之子于歸,百兩...|
|    13|       采蘩|于以采蘩?于沼于沚。于以用之?公侯...|
|    14|       草蟲|喓喓草蟲,趯趯阜螽。未見君子,憂心...|
|    15|       采蘋|于以采蘋?南澗之濱。于以采藻?于彼...|
|    16|       甘棠|蔽芾甘棠,勿翦勿伐,召伯所茇。
蔽...|
|    17|       行露|厭浥行露,豈不夙夜,謂行多露。
誰...|
|    18|       羔羊|羔羊之皮,素絲五紽。退食自公,委蛇...|
|    19|      殷其雷|殷其雷,在南山之陽。何斯違斯,莫敢...|
|    20|      摽有梅|摽有梅,其實七兮。求我庶士,迨其吉...|
+------+---------+--------------------+
only showing top 20 rows
           
from pyspark.sql.functions import concat_ws

# 合并古詩的标題和正文
sentence_df = item_info.select("poemId", 
    concat_ws(",", 
              item_info.poemTitle,
              item_info.poemContent,
             ).alias("concat_string")
)
sentence_df.show()
           
+------+--------------------+
|poemId|       concat_string|
+------+--------------------+
|     1|關雎,關關雎鸠,在河之洲。窈窕淑女...|
|     2|葛覃,葛之覃兮,施于中谷,維葉萋萋...|
|     3|卷耳,采采卷耳,不盈頃筐。嗟我懷人...|
|     4|樛木,南有樛木,葛藟累之。樂隻君子...|
|     5|螽斯,螽斯羽,诜诜兮。宜爾子孫,振...|
|     6|桃夭,桃之夭夭,灼灼其華。之子于歸...|
|     7|兔罝,肅肅兔罝,椓之丁丁。赳赳武夫...|
|     8|芣苢,采采芣苢,薄言采之。
采采芣...|
|     9|漢廣,南有喬木,不可休思。
漢有遊...|
|    10|汝墳,遵彼汝墳,伐其條枚。未見君子...|
|    11|麟之趾,麟之趾,振振公子,于嗟麟兮...|
|    12|鵲巢,維鵲有巢,維鸠居之。之子于歸...|
|    13|采蘩,于以采蘩?于沼于沚。于以用之...|
|    14|草蟲,喓喓草蟲,趯趯阜螽。未見君子...|
|    15|采蘋,于以采蘋?南澗之濱。于以采藻...|
|    16|甘棠,蔽芾甘棠,勿翦勿伐,召伯所茇...|
|    17|行露,厭浥行露,豈不夙夜,謂行多露...|
|    18|羔羊,羔羊之皮,素絲五紽。退食自公...|
|    19|殷其雷,殷其雷,在南山之陽。何斯違...|
|    20|摽有梅,摽有梅,其實七兮。求我庶士...|
+------+--------------------+
only showing top 20 rows
           
import re
import pyspark.sql.functions as F
from pyspark.sql.types import *

# 替換裡邊的特殊字元
def _filter(arg):
    arg = re.sub('[\n\r()。,、?!,]', '', arg)
    return arg

use_reg = F.udf(_filter, StringType())
sentence_df = sentence_df.select(sentence_df.poemId, \
                                     use_reg(sentence_df.concat_string).alias("all_words"))

sentence_df.show()
           
+------+--------------------+
|poemId|           all_words|
+------+--------------------+
|     1|關雎關關雎鸠在河之洲窈窕淑女君子好...|
|     2|葛覃葛之覃兮施于中谷維葉萋萋黃鳥于...|
|     3|卷耳采采卷耳不盈頃筐嗟我懷人寘彼周...|
|     4|樛木南有樛木葛藟累之樂隻君子福履綏...|
|     5|螽斯螽斯羽诜诜兮宜爾子孫振振兮螽斯...|
|     6|桃夭桃之夭夭灼灼其華之子于歸宜其室...|
|     7|兔罝肅肅兔罝椓之丁丁赳赳武夫公侯幹...|
|     8|芣苢采采芣苢薄言采之采采芣苢薄言有...|
|     9|漢廣南有喬木不可休思漢有遊女不可求...|
|    10|汝墳遵彼汝墳伐其條枚未見君子惄如調...|
|    11|麟之趾麟之趾振振公子于嗟麟兮麟之定...|
|    12|鵲巢維鵲有巢維鸠居之之子于歸百兩禦...|
|    13|采蘩于以采蘩于沼于沚于以用之公侯之...|
|    14|草蟲喓喓草蟲趯趯阜螽未見君子憂心忡...|
|    15|采蘋于以采蘋南澗之濱于以采藻于彼行...|
|    16|甘棠蔽芾甘棠勿翦勿伐召伯所茇蔽芾甘...|
|    17|行露厭浥行露豈不夙夜謂行多露誰謂雀...|
|    18|羔羊羔羊之皮素絲五紽退食自公委蛇委...|
|    19|殷其雷殷其雷在南山之陽何斯違斯莫敢...|
|    20|摽有梅摽有梅其實七兮求我庶士迨其吉...|
+------+--------------------+
only showing top 20 rows
           
import jieba
           
def get_words(partitions):
    stop_words = ['而', '何', '乎', '乃', '其', '且', '若', '所', '為', '焉', '以', 
              '因', '于', '與','也','則','者','之','不','自','得','一','來','去',
              '無', '可', '是', '已', '此', '的', '上', '中', '兮', '三', '汝', '非']
    
    def cut_sentence(sentence):
        return [i for i in jieba.cut(sentence, cut_all=True) if i not in stop_words]
    
    for row in partitions:
        yield row.poemId, cut_sentence(row.all_words)
        
sentence_df = sentence_df.rdd.mapPartitions(get_words).toDF(["poemId", "word_list"])
sentence_df.show()
           
+------+--------------------+
|poemId|           word_list|
+------+--------------------+
|     1|[關, 雎, 關關, 關關雎, 雎...|
|     2|[葛, 覃, 葛, 覃, 施, 中...|
|     3|[卷, 耳, 采采, 卷, 耳, ...|
|     4|[樛, 木, 南, 有, 樛, 木...|
|     5|[螽斯, 螽斯, 羽, 诜, 诜,...|
|     6|[桃, 夭, 桃之夭夭, 灼灼, ...|
|     7|[兔, 罝, 肅, 肅, 兔, 罝...|
|     8|[芣, 苢, 采采, 芣, 苢, ...|
|     9|[漢, 廣南, 有, 喬木, 不可...|
|    10|[墳, 遵, 彼, 墳, 伐, 條...|
|    11|[麟, 趾, 麟, 趾, 振振, ...|
|    12|[鵲巢, 維, 鵲, 有, 巢, ...|
|    13|[采, 蘩, 采, 蘩, 沼, 沚...|
|    14|[草蟲, 喓, 喓, 草蟲, 趯,...|
|    15|[采, 蘋, 采, 蘋, 南澗, ...|
|    16|[甘, 棠, 蔽, 芾, 甘, 棠...|
|    17|[行, 露, 厭, 浥, 行, 露...|
|    18|[羔羊, 羊羔, 羔羊, 皮, 素...|
|    19|[殷, 雷, 殷, 雷, 在, 南...|
|    20|[摽, 有, 梅, 摽, 有, 梅...|
+------+--------------------+
only showing top 20 rows
           
# 要計算TFIDF值,需要知道每個詞的詞頻和反文檔頻率
from pyspark.ml.feature import CountVectorizer


# 統計詞頻
cv = CountVectorizer(inputCol="word_list", outputCol="word_frequency", minDF=1.0)

cv_model = cv.fit(sentence_df)
cv_result= cv_model.transform(sentence_df)
cv_result.show()
           
+------+--------------------+--------------------+
|poemId|           word_list|      word_frequency|
+------+--------------------+--------------------+
|     1|[關, 雎, 關關, 關關雎, 雎...|(69204,[9,94,262,...|
|     2|[葛, 覃, 葛, 覃, 施, 中...|(69204,[7,13,53,7...|
|     3|[卷, 耳, 采采, 卷, 耳, ...|(69204,[2,5,13,56...|
|     4|[樛, 木, 南, 有, 樛, 木...|(69204,[3,46,91,1...|
|     5|[螽斯, 螽斯, 羽, 诜, 诜,...|(69204,[408,460,5...|
|     6|[桃, 夭, 桃之夭夭, 灼灼, ...|(69204,[3,7,120,1...|
|     7|[兔, 罝, 肅, 肅, 兔, 罝...|(69204,[100,1170,...|
|     8|[芣, 苢, 采采, 芣, 苢, ...|(69204,[3,594,550...|
|     9|[漢, 廣南, 有, 喬木, 不可...|(69204,[0,1,3,7,9...|
|    10|[墳, 遵, 彼, 墳, 伐, 條...|(69204,[13,15,18,...|
|    11|[麟, 趾, 麟, 趾, 振振, ...|(69204,[137,506,6...|
|    12|[鵲巢, 維, 鵲, 有, 巢, ...|(69204,[3,7,46,12...|
|    13|[采, 蘩, 采, 蘩, 沼, 沚...|(69204,[7,28,242,...|
|    14|[草蟲, 喓, 喓, 草蟲, 趯,...|(69204,[15,22,108...|
|    15|[采, 蘋, 采, 蘋, 南澗, ...|(69204,[3,14,61,6...|
|    16|[甘, 棠, 蔽, 芾, 甘, 棠...|(69204,[798,1016,...|
|    17|[行, 露, 厭, 浥, 行, 露...|(69204,[13,14,36,...|
|    18|[羔羊, 羊羔, 羔羊, 皮, 素...|(69204,[137,474,7...|
|    19|[殷, 雷, 殷, 雷, 在, 南...|(69204,[7,9,55,71...|
|    20|[摽, 有, 梅, 摽, 有, 梅...|(69204,[3,13,123,...|
+------+--------------------+--------------------+
only showing top 20 rows
           
print(len(cv_model.vocabulary))
cv_model.vocabulary
           
69204
['',
 ' ',
 '人',
 '有',
 '春',
 '雲',
 '花',
 '歸',
 '月',
 '在',
 '君',
 '時',
 '風',
 '我',
 '誰',
 '見',
 '日',
 '玉',
 '如',

 ...]
           
from pyspark.ml.feature import IDF
idf = IDF(inputCol="word_frequency", outputCol="IDF_value")

idfModel = idf.fit(cv_result)
rescaledData = idfModel.transform(cv_result)

rescaledData.select("word_list", "IDF_value").show()
           
+--------------------+--------------------+
|           word_list|           IDF_value|
+--------------------+--------------------+
|[關, 雎, 關關, 關關雎, 雎...|(69204,[9,94,262,...|
|[葛, 覃, 葛, 覃, 施, 中...|(69204,[7,13,53,7...|
|[卷, 耳, 采采, 卷, 耳, ...|(69204,[2,5,13,56...|
|[樛, 木, 南, 有, 樛, 木...|(69204,[3,46,91,1...|
|[螽斯, 螽斯, 羽, 诜, 诜,...|(69204,[408,460,5...|
|[桃, 夭, 桃之夭夭, 灼灼, ...|(69204,[3,7,120,1...|
|[兔, 罝, 肅, 肅, 兔, 罝...|(69204,[100,1170,...|
|[芣, 苢, 采采, 芣, 苢, ...|(69204,[3,594,550...|
|[漢, 廣南, 有, 喬木, 不可...|(69204,[0,1,3,7,9...|
|[墳, 遵, 彼, 墳, 伐, 條...|(69204,[13,15,18,...|
|[麟, 趾, 麟, 趾, 振振, ...|(69204,[137,506,6...|
|[鵲巢, 維, 鵲, 有, 巢, ...|(69204,[3,7,46,12...|
|[采, 蘩, 采, 蘩, 沼, 沚...|(69204,[7,28,242,...|
|[草蟲, 喓, 喓, 草蟲, 趯,...|(69204,[15,22,108...|
|[采, 蘋, 采, 蘋, 南澗, ...|(69204,[3,14,61,6...|
|[甘, 棠, 蔽, 芾, 甘, 棠...|(69204,[798,1016,...|
|[行, 露, 厭, 浥, 行, 露...|(69204,[13,14,36,...|
|[羔羊, 羊羔, 羔羊, 皮, 素...|(69204,[137,474,7...|
|[殷, 雷, 殷, 雷, 在, 南...|(69204,[7,9,55,71...|
|[摽, 有, 梅, 摽, 有, 梅...|(69204,[3,13,123,...|
+--------------------+--------------------+
only showing top 20 rows
           
array([ 0.15462493,  0.16263591,  1.66914546, ..., 10.49698013,
       10.49698013, 10.49698013])
           
keywords_list_with_idf = list(zip(cv_model.vocabulary, idfModel.idf.toArray()))
keywords_list_with_idf
           
[('', 0.15462493099127095),
 (' ', 0.16263591402670433),
 ('人', 1.6691454619329604),
 ('有', 1.799717916778991),
 ('春', 1.7968822184651163),
 ('雲', 1.8183485926421212),
 ('花', 1.9757949172501144),
 ('歸', 1.966772001442157),
 ('月', 1.9429729653969354),
 ('在', 2.016036071194773),
 ('君', 2.165393886305136),
 ('時', 2.0749771255234006),
 ('風', 2.066434745245324),
 ...]
           
from functools import partial

def _tfidf(partition, kw_list):
    for row in partition:
        words_length = len(set(row.word_list))    # 統計文檔中單詞總數

        for index in row.word_frequency.indices:
            word, idf = kw_list[int(index)] 
            tf = row.word_frequency[int(index)]/words_length   # 計算TF值
            tfidf = float(tf)*float(idf)    # 計算該詞的TFIDF值
            yield row.poemId, word, tfidf

# 使用partial為函數預定義要傳入的參數
tfidf = partial(_tfidf, kw_list=keywords_list_with_idf)            

keyword_tfidf = cv_result.rdd.mapPartitions(tfidf)
keyword_tfidf = keyword_tfidf.toDF(["poemId","keyword", "tfidf"])
keyword_tfidf.show()
           
+------+-------+--------------------+
|poemId|keyword|               tfidf|
+------+-------+--------------------+
|     1|      在|0.057601030605564936|
|     1|      思| 0.07921524240863861|
|     1|      流| 0.09759119429122505|
|     1|      關| 0.10977667106706135|
|     1|      采|  0.1198127377805513|
|     1|     不得| 0.11853473276288638|
|     1|     君子| 0.14131315456663138|
|     1|     參差| 0.40177176553544663|
|     1|      友|  0.1353286374743755|
|     1|      服| 0.13801581262352733|
|     1|     窈窕|  0.6544337720537399|
|     1|     左右|  0.4968428369265834|
|     1|     鐘鼓|  0.1849036696914497|
|     1|     悠哉|  0.3774377046757293|
|     1|      菜|  0.5753820425157967|
|     1|      荇|  0.5824866990705768|
|     1|     琴瑟|  0.2073800479363288|
|     1|     關關| 0.21292736263464188|
|     1|     寤寐| 0.45325890128424867|
|     1|      雎| 0.24643365580097992|
+------+-------+--------------------+
only showing top 20 rows
           
+------+-------+------------------+
|poemId|keyword|             tfidf|
+------+-------+------------------+
|     8|      苢| 7.347886090955122|
|     8|      芣| 6.706482578643214|
|     8|     薄言|5.1751067718205785|
| 46406|      耶| 4.948944803640301|
|    36|     式微| 4.750594433822543|
|     8|     采采| 4.598260071527804|
|    84|      萚|4.4562877042617925|
| 47827|     阿房| 4.178456983219809|
|   610|     沒了| 4.037300049975342|
|    11|      麟| 3.801857875632987|
|    48|      彊| 3.759347136507112|
|    18|     委蛇| 3.759347136507112|
|     5|     螽斯| 3.483887053840631|
| 46611|     丹徒| 3.450071181213719|
| 46555|     舉子| 3.367015435302422|
| 46616|     徐聞|3.2679443164586477|
| 47675|    段幹木|3.1935631326872445|
|    48|      奔|3.1468240211335554|
| 65160|      囗| 3.132789280422593|
| 70885|     蓮葉| 3.114219717233316|
+------+-------+------------------+
only showing top 20 rows
           
from pyspark.ml.feature import Word2Vec

word2Vec = Word2Vec(vectorSize=10, inputCol="word_list", outputCol="model")
model = word2Vec.fit(sentence_df)
           
vectors = model.getVectors()
vectors.show()
           
+----+--------------------+
|word|              vector|
+----+--------------------+
| 半日閑|[-0.0561875551939...|
|  箭頭|[0.12245123833417...|
|  臨風|[-0.2168009430170...|
|  琴書|[-0.2475093156099...|
|   邱|[0.05485768243670...|
|  蓮社|[0.09266380965709...|
|  人物|[-0.0869999676942...|
|  長幼|[0.18391117453575...|
|   冉|[0.60794192552566...|
|  石道|[0.09541463851928...|
|   婺|[0.52122080326080...|
|  瀉出|[-0.0349056571722...|
|  黃泉|[-0.0684615746140...|
|  本源|[-0.0193176493048...|
|  自養|[0.06587809324264...|
|  吾國|[-0.0718164145946...|
|  命作|[0.02833748422563...|
|  相憐|[1.22657394967973...|
|  疏狂|[-0.0754516199231...|
|   溱|[0.28383380174636...|
+----+--------------------+
only showing top 20 rows
           
# 合并
df1 = keyword_tfidf.join(vectors, keyword_tfidf.keyword==vectors.word, "inner")
df1.show()
           
+------+-------+--------------------+----+--------------------+
|poemId|keyword|               tfidf|word|              vector|
+------+-------+--------------------+----+--------------------+
|     1|      在|0.057601030605564936|   在|[-0.0991405770182...|
|     1|      思| 0.07921524240863861|   思|[-0.2970317900180...|
|     1|      流| 0.09759119429122505|   流|[-0.0167482625693...|
|     1|      關| 0.10977667106706135|   關|[0.00378159806132...|
|     1|      采|  0.1198127377805513|   采|[-0.0207617487758...|
|     1|     不得| 0.11853473276288638|  不得|[-0.0765498951077...|
|     1|     君子| 0.14131315456663138|  君子|[-0.0510748177766...|
|     1|     參差| 0.40177176553544663|  參差|[0.17438517510890...|
|     1|      友|  0.1353286374743755|   友|[-0.1573343873023...|
|     1|      服| 0.13801581262352733|   服|[0.43713137507438...|
|     1|     窈窕|  0.6544337720537399|  窈窕|[0.03826018050312...|
|     1|     左右|  0.4968428369265834|  左右|[0.15486374497413...|
|     1|     鐘鼓|  0.1849036696914497|  鐘鼓|[0.12827932834625...|
|     1|     悠哉|  0.3774377046757293|  悠哉|[-0.2879377007484...|
|     1|      菜|  0.5753820425157967|   菜|[0.29936558008193...|
|     1|      荇|  0.5824866990705768|   荇|[0.49636653065681...|
|     1|     琴瑟|  0.2073800479363288|  琴瑟|[0.10872857272624...|
|     1|     關關| 0.21292736263464188|  關關|[0.02008588984608...|
|     1|     寤寐| 0.45325890128424867|  寤寐|[0.22766476869583...|
|     1|      雎| 0.24643365580097992|   雎|[0.00680365599691...|
+------+-------+--------------------+----+--------------------+
only showing top 20 rows
           
# print(keyword_tfidf.count())
# print(df1.count())
           
# 使用每個詞的權重乘以沒個詞的詞向量
df2 = df1.rdd.map(lambda r:(r.poemId, r.keyword, r.tfidf*r.vector)).toDF(["poemId", "keyword","vector"])
df2.show()
           
+------+-------+--------------------+
|poemId|keyword|              vector|
+------+-------+--------------------+
|     1|      在|[-0.0057105994110...|
|     1|      思|[-0.0235294452493...|
|     1|      流|[-0.0016344829464...|
|     1|      關|[4.15131246485710...|
|     1|      采|[-0.0024875219619...|
|     1|     不得|[-0.0090738213596...|
|     1|     君子|[-0.0072175436189...|
|     1|     參差|[0.07006303968671...|
|     1|      友|[-0.0212918482614...|
|     1|      服|[0.06033104195413...|
|     1|     窈窕|[0.02503875424612...|
|     1|     左右|[0.07694294239002...|
|     1|     鐘鼓|[0.02371931855677...|
|     1|     悠哉|[-0.1086785448600...|
|     1|      菜|[0.17224957892647...|
|     1|      荇|[0.28912690197140...|
|     1|     琴瑟|[0.02254813662401...|
|     1|     關關|[0.00427683555109...|
|     1|     寤寐|[0.10319108292020...|
|     1|      雎|[0.00167664982013...|
+------+-------+--------------------+
only showing top 20 rows
           
# 分組求出權重詞向量的平均值
# 建立臨時表
df2.registerTempTable("tempTable")

def map(row):
    x = 0
    for v in row.vectors:
        x += v
    #  将平均向量作為sku的向量
    return row.poemId, x/len(row.vectors)

# collect_set 是 hive 裡面的方法
group_vector = spark.sql("select poemId, collect_set(vector) vectors from tempTable group by poemId").rdd.map(map).toDF(["poemId", "vector"])
           
# 計算兩首古詩之間的餘弦相似度
v1 = group_vector.where("poemId=1").select("vector").first().vector
v2 = group_vector.where("poemId=2").select("vector").first().vector

np.dot(v1,v2)/(np.linalg.norm(v1)*(np.linalg.norm(v2)))
           
0.378789818447398
           
group_df
           
poemId vector
26 [0.008604375373175262, 0.025260454274981203, -...
1 29 [0.017804960879639398, 0.04290195911035433, 0....
2 474 [0.008394777557195702, 0.025892256004034084, 0...
3 964 [4.312995210865666e-05, 0.04748191617755371, -...
4 1677 [0.014743320294977227, -0.031115929525233794, ...
... ... ...
72406 73102 [0.007128699103552195, 0.008001016583454229, 0...
72407 73148 [0.018791418906944357, -0.026507278938189598, ...
72408 73179 [-0.002067103872106531, -0.021090942436392375,...
72409 73240 [-0.005216105655755288, -0.018684123660531675,...
72410 73245 [0.0012883507518144945, -0.005777893401733697,...

72411 rows × 2 columns

0.004892947822319301
           
def cosine_similarity(v1, v2):
    return np.dot(v1,v2)/(np.linalg.norm(v1)*(np.linalg.norm(v2)))

# for i in range(1, 73282):
#     v1 = group_df[group_df["poemId"] == 1]["vector"].values[0]
#     for j in range(1, 73282):
#         if i != j:
            
# cosine_similarity(group_df[group_df["poemId"] == 1]["vector"].values[0], group_df[group_df["poemId"] == 2]["vector"].values[0])
           
DenseVector([0.0302, 0.032, -0.0349, 0.0162, -0.0161, -0.0174, -0.0619, 0.0057, -0.0074, 0.0124])
           
73281
           
group_df[group_df["poemId"] == 49]["vector"].values
           
array([], dtype=object)
           
none_list = []

for i in range(1, 73282):
    if not group_df[group_df["poemId"] == i]["vector"].values:
        none_list.append(i)
           
(72411, 870)
           
from tqdm import tqdm
import random

content_base_dict = {}

# 現根據标簽篩選一部分
for i in tqdm(exist_id):
    v1 = group_df[group_df["poemId"] == i]["vector"].values[0]
#     random_id = random.choices(exist_id, k=1000)改成從标簽中篩選
    store_list = []
    for j in random_id:
        if i != j:
            v2 = group_df[group_df["poemId"] == j]["vector"].values[0]
            sim = cosine_similarity(v1, v2)
            store_list.append((j, sim))
    value = sorted(store_list, key=lambda x:x[1], reverse=True)[:100]
    content_base_dict[i] = value
    break
           
poem_to_tag = pd.read_csv("/home/liang/Desktop/python_file/source.csv")
poem_to_tag = poem_to_tag[["poemId", "poemDynasty", "poemAuthorId", "poemTagNames"]]
poem_to_tag
           
poemId poemDynasty poemAuthorId poemTagNames
1 先秦 古詩三百首,國中古詩,詩經,愛情
1 2 先秦 詩經,寫人
2 3 先秦 詩經,懷人
3 4 先秦 詩經,祝福
4 5 先秦 詩經,寫鳥,祝福
... ... ... ... ...
72406 73277 宋代 536 歌頌,古人,傷懷,國家
72407 73278 宋代 536 抒情,追憶,思念
72408 73279 宋代 196 羁旅,抒情,思鄉
72409 73280 魏晉 616 荷花,寫景,夜晚,心情
72410 73281 宋代 272 晚上,乘船,寫景,抒情

72411 rows × 4 columns

poem_to_tag["poemAuthorId"] += 100000
poem_to_tag
           
poemId poemDynasty poemAuthorId poemTagNames
1 先秦 100000 古詩三百首,國中古詩,詩經,愛情
1 2 先秦 100000 詩經,寫人
2 3 先秦 100000 詩經,懷人
3 4 先秦 100000 詩經,祝福
4 5 先秦 100000 詩經,寫鳥,祝福
... ... ... ... ...
72406 73277 宋代 100536 歌頌,古人,傷懷,國家
72407 73278 宋代 100536 抒情,追憶,思念
72408 73279 宋代 100196 羁旅,抒情,思鄉
72409 73280 魏晉 100616 荷花,寫景,夜晚,心情
72410 73281 宋代 100272 晚上,乘船,寫景,抒情

72411 rows × 4 columns

import pymongo

client = pymongo.MongoClient("localhost", 27017)
mycol = client['RS']
houxuanji = mycol['online_recall']
           
from tqdm import tqdm
import random


content_base_dict = {}

def findFromMongoDB(collection, tag):
    temp_set = set()
    query_dict = {
        "cateId":tag,
    }
    found_json = collection.find(query_dict)
    for _json in found_json:
        items = _json.get("incluted")
        items_len = _json.get("len")
        if items_len < 100: # 每個類别随機選取100個
            pass
        else:
            items = random.choices(items, k=100)
        temp_set = temp_set.union(items)
        
    return temp_set

# 現根據标簽篩選一部分
for i in tqdm(exist_id):
    v1 = group_df[group_df["poemId"] == i]["vector"].values[0]
    # 古詩id -> 古詩标簽 -> 根據标簽進行篩選
    temp = poem_to_tag[poem_to_tag["poemId"] == i]
    tags = temp["poemTagNames"].values[0]
    poemId_set = set()
    if tags is not np.nan:
        tags = tags.split(',')
        for tag in tags:
            ids = findFromMongoDB(houxuanji, tag)
            poemId_set = poemId_set.union(ids)
    
    dynasty = temp["poemDynasty"].values[0]
    author = temp["poemAuthorId"].values[0]
    ids1 = findFromMongoDB(houxuanji, dynasty)
    ids2 = findFromMongoDB(houxuanji, int(author))
    poemId_set = poemId_set.union(ids1)
    poemId_set = poemId_set.union(ids2)
    
    print("length:",len(poemId_set))
    store_list = []
    for j in poemId_set:
        if i != j:
            v2 = group_df[group_df["poemId"] == j]["vector"].values[0]
            sim = cosine_similarity(v1, v2)
            store_list.append((j, sim))
            
    value = sorted(store_list, key=lambda x:x[1], reverse=True)[:100]
    content_base_dict[i] = [tup[0] for tup in value]
           
'古詩三百首,國中古詩,詩經,愛情'
           
content_base_dict
           
{1: [48013,
  48005,
  42372,
  606,
  21,
  70919,
  6525,
  69012,
  48107,
  47991,
  15,
  47642,
  47682,
  10521,
  72453,
  47981,
  34,
  47873,
  70886,
  69221,
  127,
  45792,
  70933,
  137,
  7734,
  69077,
  10374,
  71030,
  2369,
  47685,
  47692,
  42412,
  2271,
  70978,
  7,
  68055,
  68709,
  20285,
  69086,
  47708,
  69068,
  48147,
  9849,
  47728,
  71708,
  9098,
  78,
  71043,
  30793,
  39490,
  69053,
  47813,
  42180,
  71002,
  70940,
  70897,
  9315,
  172,
  69071,
  6606,
  47044,
  70999,
  21173,
  39469,
  30,
  223,
  59,
  67634,
  16,
  69055,
  45951,
  5749,
  151,
  47696,
  24030,
  71203,
  213,
  71780,
  48134,
  150,
  69088,
  62369,
  71205,
  69067,
  159,
  21137,
  58,
  47733,
  59106,
  47966,
  52809,
  67872,
  111,
  6137,
  70765,
  69142,
  71646,
  48041,
  67880,
  4933]}
           
import pickle
           
import pickle
           
[117,
 53,
 181,
 47848,
 72271,
 32543,
 186,
 72503,
 47667,
 72504,
 47698,
 71161,
 47860,
 47963,
 69979,
 47841,
 72196,
 96,
 71128,
 185,
 71968,
 10470,
 72474,
 85,
 165,
 47711,
 42342,
 273,
 155,
 267,
 14,
 33,
 72518,
 69403,
 100,
 32515,
 72662,
 72962,
 7914,
 153,
 72404,
 71154,
 71165,
 114,
 69481,
 109,
 72236,
 251,
 162,
 274,
 224,
 11266,
 71124,
 21281,
 184,
 112,
 262,
 192,
 95,
 64,
 71643,
 62,
 47799,
 51,
 221,
 47665,
 17,
 226,
 46660,
 303,
 231,
 32970,
 58,
 72245,
 160,
 23,
 72180,
 46977,
 47700,
 47814,
 55,
 72844,
 18309,
 47834,
 253,
 261,
 147,
 46401,
 210,
 47788,
 168,
 71268,
 72213,
 83,
 47824,
 47783,
 72373,
 61,
 286,
 18220]
           

繼續閱讀