Gensim庫之Doc2Vec模型詳解Gensim庫之Doc2Vec模型詳解

2023-04-13 10:30:03

Gensim庫之Doc2Vec模型詳解

models.doc2vec – Doc2vec paragraph embeddings： TaggedDocument: 對于輸入的文檔 text,轉換為:TaggedDocument(text, [i])的形式，i為文檔編号 class gensim.models.doc2vec.Doc2Vec( documents=None, 輸入語料庫，是TaggedDocument清單 corpus_file=None, Path to a corpus file in LineSentence format. dm_mean=None, 取0，則使用上下文詞向量的和。取1，使用平均值。僅當dm在非串聯模式下使用時适用(dm_concat=0)。 dm=1, 選擇訓練算法，dm = 1 使用PV-DM;dm = 0,使用PV-DBOW dbow_words=0, 取1，則與dbow-doc向量訓練同時訓練單詞向量(以Skip-gram的方式)；取0，則僅訓練文檔向量(更快)。 dm_concat=0, 取1時，則使用上下文向量的串聯，取0時求和/平均值； negative = 0, >0,使用負抽樣(噪聲詞在5~20之間);=0,不适用噪聲詞; epochs (int, optional),疊代次數; hs ({1,0}, optional), 取 1, 使用hierarchical softmax . 取 0, 且negative 非0, 使用負抽樣 . dm_tag_count=1, docvecs=None, docvecs_mapfile=None, comment=None, trim_rule=None, callbacks=(), **kwargs)

使用Doc2Vec()方法訓練得到的model中包含以下對象：

(1)wv(Word2VecKeyedVectors):

word2vec對象存儲單詞和向量之間的映射。用于對向量進行查找、距離、相似性計算等操作。

方法：

① closer_than(entity1, entity2)

Get all entities that are closer to entity1 than entity2 is to entity1.

② cosine_similarities(vector_1, vectors_all)

Compute cosine similarities between one vector and a set of other vectors.

③ distance(w1, w2)

Compute cosine distance between two words.

④ distances(word_or_vector, other_words=())

Compute cosine distances from given word or vector to all words in other_words.

If other_words is empty, return distance between word_or_vectors and all words

in vocab.

⑤ get_vector(word)

Get the entity’s representations in vector space, as a 1D numpy array.

⑥ most_similar_cosmul(positive=None, negative=None, topn=10)

Find the top-N most similar words, using the multiplicative combination objective.

⑦ most_similar_to_given(entity1, entities_list)

Get the entity from entities_list most similar to entity1.

⑧ n_similarity(ws1, ws2)

Compute cosine similarity between two sets of words.

⑨ relative_cosine_similarity(wa, wb, topn=10)

Compute the relative cosine similarity between two words given top-n similar words;

⑩ save(path) Save KeyedVectors. load(path)Load KeyedVectors.

⑪ wmdistance(document1, document2)

Compute the Word Mover’s Distance(詞移距離) between two documents.

⑫ word_vec(word, use_norm=False)

Get word representations in vector space, as a 1D numpy array.

(2)docvecs(Doc2VecKeyedVectors):

此對象包含段落向量。記住，這個模型和word2vec之間的唯一差別是，除了詞向量之外，我們還包括段落嵌入

來捕獲段落。

該對象中的方法基本與WV中的方法相同；

(3)vocabulary(Doc2VecVocab):

這個對象表示模型的詞彙表(字典)。除了跟蹤所有獨特的單詞之外，這個對象還提供了額外的功能，比如按頻率

對單詞排序，或者丢棄非常罕見的單詞。

Doc2Vec的方法：

①most_similar(**kwargs) Deprecated, use self.wv.most_similar() instead.

②most_similar_cosmul(**kwargs) Deprecated, use self.wv.most_similar_cosmul() instead.

③n_similarity(**kwargs) Deprecated, use self.wv.n_similarity() instead.

Gensim庫之Doc2Vec模型詳解Gensim庫之Doc2Vec模型詳解

Gensim庫之Doc2Vec模型詳解

繼續閱讀

文本分類之 residual-connection+selfAttention的詞向量平均模型

新聞文本分類-06 基于Bert的文本分類

seq2seq模型 + Attention機制

elasticlunr.js 最新版本v0.6.7釋出啦應用示例為什麼你需要elasticlunr.js?

RNN/LSTM學習資料總結

使用中文維基百科進行GloVe實驗

從詞向量衡量标準到全局向量的詞嵌入模型GloVe再到一詞多義的解決方式衡量标準Evaluation引子全局向量的詞嵌入應用對一詞多義的思考Reference

NLP︱進階詞向量表達（一）——GloVe（理論、相關測評結果、R&python實作、相關應用）一、理論簡述二、測評三、Glove實作&R&python四、相關應用

GloVe與word2vec的差別，及GloVe的缺陷

更别緻的詞向量模型(一)：simpler glove

glove_python安裝（避免編譯錯誤）

python 分析qq聊天記錄

[一起學BERT]（一）：BERT模型的原理基礎Self-Attention機制理論Multi-head Self-Attention注意力機制位置編碼Transformer理論BERT理論

ELMO BERT GPT

BERT、Elmo、GPT一、發展曆史二、bert三、ERNIE四、GPT—transformer的decoder

人工智能如何有效地運用于自然語言處理