天天看点

Gensim库之Doc2Vec模型详解Gensim库之Doc2Vec模型详解

Gensim库之Doc2Vec模型详解

models.doc2vec – Doc2vec paragraph embeddings: TaggedDocument: 对于输入的文档 text,转换为:TaggedDocument(text, [i])的形式,i为文档编号 class gensim.models.doc2vec.Doc2Vec( documents=None, 输入语料库,是TaggedDocument列表 corpus_file=None, Path to a corpus file in LineSentence format. dm_mean=None, 取0,则使用上下文词向量的和。取1,使用平均值。 仅当dm在非串联模式下使用时适用(dm_concat=0)。 dm=1, 选择训练算法,dm = 1 使用PV-DM;dm = 0,使用PV-DBOW dbow_words=0, 取1,则与dbow-doc向量训练同时训练单词向量(以Skip-gram的方式); 取0,则仅训练文档向量(更快)。 dm_concat=0, 取1时,则使用上下文向量的串联,取0时求和/平均值; negative = 0, >0,使用负抽样(噪声词在5~20之间);=0,不适用噪声词; epochs (int, optional),迭代次数; hs ({1,0}, optional), 取 1, 使用hierarchical softmax . 取 0, 且negative 非0, 使用负抽样 . dm_tag_count=1, docvecs=None, docvecs_mapfile=None, comment=None, trim_rule=None, callbacks=(), **kwargs)

使用Doc2Vec()方法训练得到的model中包含以下对象:

(1)wv(Word2VecKeyedVectors):

word2vec对象存储单词和向量之间的映射。用于对向量进行查找、距离、相似性计算等操作。

方法:

① closer_than(entity1, entity2)

Get all entities that are closer to entity1 than entity2 is to entity1.

② cosine_similarities(vector_1, vectors_all)

Compute cosine similarities between one vector and a set of other vectors.

③ distance(w1, w2)

Compute cosine distance between two words.

④ distances(word_or_vector, other_words=())

Compute cosine distances from given word or vector to all words in other_words.

If other_words is empty, return distance between word_or_vectors and all words

in vocab.

⑤ get_vector(word)

Get the entity’s representations in vector space, as a 1D numpy array.

⑥ most_similar_cosmul(positive=None, negative=None, topn=10)

Find the top-N most similar words, using the multiplicative combination objective.

⑦ most_similar_to_given(entity1, entities_list)

Get the entity from entities_list most similar to entity1.

⑧ n_similarity(ws1, ws2)

Compute cosine similarity between two sets of words.

⑨ relative_cosine_similarity(wa, wb, topn=10)

Compute the relative cosine similarity between two words given top-n similar words;

⑩ save(path) Save KeyedVectors. load(path)Load KeyedVectors.

⑪ wmdistance(document1, document2)

Compute the Word Mover’s Distance(词移距离) between two documents.

⑫ word_vec(word, use_norm=False)

Get word representations in vector space, as a 1D numpy array.

(2)docvecs(Doc2VecKeyedVectors):

此对象包含段落向量。记住,这个模型和word2vec之间的唯一区别是,除了词向量之外,我们还包括段落嵌入

来捕获段落。

该对象中的方法基本与WV中的方法相同;

(3)vocabulary(Doc2VecVocab):

这个对象表示模型的词汇表(字典)。除了跟踪所有独特的单词之外,这个对象还提供了额外的功能,比如按频率

对单词排序,或者丢弃非常罕见的单词。

Doc2Vec的方法:

①most_similar(**kwargs) Deprecated, use self.wv.most_similar() instead.

②most_similar_cosmul(**kwargs) Deprecated, use self.wv.most_similar_cosmul() instead.

③n_similarity(**kwargs) Deprecated, use self.wv.n_similarity() instead.

继续阅读