筆者很早就對LDA模型着迷,最近在學習gensim庫發現了LDA比較有意義且項目較為完整的Tutorials,于是乎就有本系列,本系列包含三款:Latent Dirichlet Allocation、Author-Topic Model、Dynamic Topic Models
pyLDA系列模型 | 解析 | 功能 |
---|---|---|
ATM模型(Author-Topic Model) | 加入監督的’作者’,每個作者對不同主題的偏好;弊端:chained topics, intruded words, random topics, and unbalanced topics (see Mimno and co-authors 2011) | 作者主題偏好、詞語主題偏好、相似作者推薦、可視化 |
LDA模型(Latent Dirichlet Allocation) | 主題模型 | 文章主題偏好、單詞的主題偏好、主題内容展示、主題内容矩陣 |
DTM模型(Dynamic Topic Models) | 加入時間因素,不同主題随着時間變動 | 時間-主題詞條矩陣、主題-時間詞條矩陣、文檔主題偏好、新文檔預測、跨時間+主題屬性的文檔相似性 |
案例與資料主要來源,jupyter notebook可見gensim的官方github
詳細解釋可見:Dynamic Topic Modeling in Python
.
1、理論介紹
論文出處:
David Blei does a good job explaining the theory behind this in this Google talk. If you prefer to directly read the paper on DTM by Blei and Lafferty
參考部落格:This
相關作用:
Dynamic Topic Models (DTM) leverages the knowledge of different documents belonging to a different time-slice in an attempt to map how the words in a topic change over time.
同時,DTM每個時期都有同樣K個主題數,性能是C++版本的DTM模型的5-7倍
-
(1)You want to find semantically similar documents; one from the very
beginning of your time-line, and one in the very end.
從一個時間點到另外時間點中,相似文章有哪些。基于時間的主題,裡面的關鍵詞卻會發生相應變化。
傳統的相似技術不可能做到這樣的效果,相同的主題基本内容不變,但是關鍵詞會随着時間而發生變化,也就是所謂的:Time corrected
Document Similarity 具有時間校對功能的文檔相似性
- (2)第二個性能:觀察主題中,關鍵詞随時間如何變化,随着時間變化,一開始主題中的詞語比較發散式,之後會變得越來越成熟。相關理論可見:

-
(3)正常性能 the points of interest would be in what the topics are and
how the documents are made up of these topics.
函數或模型 | 作用 |
---|---|
print_topics | 不同時期的5個主題的情況 |
print_topic_times | 每個主題的3個時期,主題重要詞分别是什麼 |
doc_topics | 不同文檔主題偏好(正常),跟LDA中get_document_topics 一緻,傳回内容如下:[ 5.46298825e-05 2.18468637e-01 5.46298825e-05 5.46298825e-05 7.81367474e-01] 下文稱為主題偏好向量 |
model[]新文檔預測 | ldaseq[dictionary.doc2bow( [‘economy’, ‘bank’, ‘mobile’, ‘phone’, ‘markets’, ‘buy’, ‘football’, ‘united’, ‘giggs’])],傳回内容:[ 0.00110497 0.00110497 0.00110497 0.00110497 0.99558011] |
print_topics,跨時間+主題屬性的文檔相似性 | hellinger(doc_football_1, doc_football_2),傳回的是:0.95828252517231205 |
.
2 Dynamic Topic Models模型需要材料
材料 | 解釋 | 示例 |
---|---|---|
corpus | 用過gensim 都懂 | [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(0, 1), (4, 1), (5, 1), (7, 1), (8, 1), (9, 2)]] |
dictionary | 用過gensim 都懂,dictionary = Dictionary(docs) | docs的格式,每篇文章都變成如下樣式,然後整入List之中:[‘probabilistic’, ‘characterization’, ‘of’, ‘neural’, ‘model’, ‘computation’, ‘richard’,’david_rumelhart’, ‘-PRON-_grateful’, ‘helpful_discussion’, ‘early_version’, ‘like_thank’, ‘brown_university’] |
id2word | 每個詞語ID的映射表,dictionary構成,id2word = dictionary.id2token | {0: ’ 0’, 1: ’ American nstitute of Physics 1988 ‘, 2: ’ Fig’, 3: ’ The’, 4: ‘1 1’, 5: ‘2 2’, 6: ‘2 3’, 7: ‘CA 91125 ‘, 8: ‘CONCLUSIONS ‘} |
time-slices | 時間記錄,按順序來排列 | time_slice = [438, 430, 456]代表着,三個月的時間裡,第一個有438篇文章,第二個月有430篇文章,第三個月有456篇文章。當然可以以年月日以及任意計量時間都可以。 |
.
3、Dynamic Topic Models函數解析
3.1 主函數解析
LdaSeqModel(corpus=None, time_slice=None, id2word=None, alphas=0.01, num_topics=10, initialize='gensim', sstats=None, lda_model=None, obs_variance=0.5, chain_variance=0.005, passes=10, random_state=None, lda_inference_max_iter=25, em_min_iter=6, em_max_iter=20, chunksize=100)
複制
正常參數可參考:pyLDA系列︱gensim中的主題模型(Latent Dirichlet Allocation)
不太一樣的參數:
-
time_slice(最重要),time_slice = [438, 430,
456]代表着,三個月的時間裡,第一個有438篇文章,第二個月有430篇文章,第三個月有456篇文章。當然可以以年月日以及任意計量時間都可以。
- chain_variance:話題演變的快慢是由參數variance影響着,其實LDA中Beta參數的高斯參數,chain_variance的預設是0.05,提高該值可以讓演變加速
- initialize:兩種訓練DTM模型的方式,第一種直接用語料,第二種用已經訓練好的LDA中的個别統計參數矩陣給入作訓練。
最簡訓練模式:
%time ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics=5)
複制
3.2 兩種訓練模式
- 第一種就是上面提到的最簡模式,也就是
:initialize='gensim'
ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics=5)
複制
- 第二種模式
initialize='own'
: 已經有了訓練好的LDA模型,可以把一些參數解析出來,然後給入模型,此時就需要調整.
You can use your own sstats of an LDA model previously trained as
well by specifying ‘own’ and passing a np matrix through sstats. If
you wish to just pass a previously used LDA model, pass it through
lda_model Shape of sstats is (vocab_len, num_topics)
.
4、輔助參數及功能點
4.1 輔助函數一:print_topics
print_topics(time=0, top_terms=20)
複制
不同時期的5個主題的概況,其中time是指時期階段,官方案例中訓練有三個時期,就是三個月,那麼time可選:[0,1,2],傳回的内容格式為:(word, word_probability)
from gensim.models import ldaseqmodel
from gensim.corpora import Dictionary, bleicorpus
import numpy
from gensim.matutils import hellinger
%time ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics=5)
ldaseq.print_topics(time=1)
>>> [[(u'blair', 0.0062483544617615841),
(u'labour', 0.0059223974769828398)],
[(u'film', 0.0050860317914258307),
(u'minister', 0.0044210871797581213)],
[(u'government', 0.0039312390246381002),
(u'election', 0.0038787664682510613)],
[(u'prime', 0.0038773564950156151),
(u'party', 0.0036428824115890975)],
[(u'brown', 0.0034964052703373195),
(u'howard', 0.0032628247571913969)]
複制
傳回的内容是,每個時期的5個主題,案例中為時期記号為’0’的時期中,5個主題内關鍵詞分别是什麼。
4.2 輔助函數二:print_topic_times
print_topic_times(topic, top_terms=20)
複制
不同主題三個時期的情況
ldaseq.print_topic_times(topic=0)
>>> [[(u'blair', 0.0061528696567048772),
(u'labour', 0.0054905202853533239)],
[(u'film', 0.0051444037762632807),
(u'minister', 0.0043556939573101399)],
[(u'government', 0.0038839073759585367),
(u'election', 0.0037979240057325865)]
複制
傳回的内容是:每個主題的3個時期,主題重要詞分别是啥。
4.3 輔助函數三:doc_topics
doc_topics(doc_number)
複制
不同文檔主題偏好(正常)
doc = ldaseq.doc_topics(558) # check the 558th document in the corpuses topic distribution
print (doc)
>>> [ 5.46298825e-05 2.18468637e-01 5.46298825e-05 5.46298825e-05 7.81367474e-01]
複制
五個主題中,第558篇文章對1,4号主題更為偏好。
當然可以通過解析corpus來進行核實:
words = [dictionary[word_id] for word_id, count in corpus[558]]
print (words)
>>> [u'set', u'time,"', u'chairman', u'decision', u'news', u'director', u'former', u'vowed', u'"it', u'results', u'club', u'third', u'home', u'paul', u'saturday.', u'south', u'conference']
複制
4.4 新文檔預測
如果有一個新文檔過來,怎麼進行預測呢?
doc_football_1 = ['economy', 'bank', 'mobile', 'phone', 'markets', 'buy', 'football', 'united', 'giggs']
doc_football_1 = dictionary.doc2bow(doc_football_1)
doc_football_1 = ldaseq[doc_football_1]
print (doc_football_1)
>>> [ 0.00110497 0.00110497 0.00110497 0.00110497 0.99558011]
複制
步驟就是先把
doc_football_1
文檔分詞,然後進行
dictionary.doc2bow
矢量化,然後就可以用
ldaseq
模型進行預測,可見結果,與doc_topics 類似,傳回的是該新文檔,與四号主題比較貼合。
4.5 跨時間+主題屬性的文檔相似性(核心功能)
dtms主題模組化更友善的用途之一是我們可以比較不同時間範圍内的文檔,并檢視它們在主題方面的相似程度。當這些時間段中的單詞不一定重疊時,這是非常有用的。
doc_football_2 = ['arsenal', 'fourth', 'wenger', 'oil', 'middle', 'east', 'sanction', 'fluctuation']
doc_football_2 = dictionary.doc2bow(doc_football_2)
doc_football_2 = ldaseq[doc_football_2]
>>> array([ 0.00141844, 0.00141844, 0.00141844, 0.00141844, 0.99432624])
hellinger(doc_football_1, doc_football_2)
>>> 0.0062680905375190245
複制
步驟跟之前的預測差不多,先把文檔
dictionary.doc2bow
矢量化,然後變成文檔主題偏好向量(1*5),然後根據兩個文檔的主題偏好向量用hellinger距離進行求相似。
4.6 可視化模型DTMvis
from gensim.models.wrappers.dtmmodel import DtmModel
from gensim.corpora import Dictionary, bleicorpus
import pyLDAvis
# dtm_path = "/Users/bhargavvader/Downloads/dtm_release/dtm/main"
# dtm_model = DtmModel(dtm_path, corpus, time_slice, num_topics=5, id2word=dictionary, initialize_lda=True)
# dtm_model.save('dtm_news')
# if we've saved before simply load the model
# ldaseq_chain.save('dtm_news')
dtm_model = DtmModel.load('dtm_news')
doc_topic, topic_term, doc_lengths, term_frequency, vocab = dtm_model.dtm_vis(time=0, corpus=corpus)
vis_wrapper = pyLDAvis.prepare(topic_term_dists=topic_term, doc_topic_dists=doc_topic, doc_lengths=doc_lengths, vocab=vocab, term_frequency=term_frequency)
pyLDAvis.display(vis_wrapper)
複制
.
5、話題一緻性評價名額
from gensim.models.coherencemodel import CoherenceModel
import pickle
# we just have to specify the time-slice we want to find coherence for.
topics_wrapper = dtm_model.dtm_coherence(time=0) # 标準模型
topics_dtm = ldaseq.dtm_coherence(time=2) # 本次模型
# running u_mass coherence on our models
cm_wrapper = CoherenceModel(topics=topics_wrapper, corpus=corpus, dictionary=dictionary, coherence='u_mass')
cm_DTM = CoherenceModel(topics=topics_dtm, corpus=corpus, dictionary=dictionary, coherence='u_mass')
print ("U_mass topic coherence")
print ("Wrapper coherence is ", cm_wrapper.get_coherence())
print ("DTM Python coherence is", cm_DTM.get_coherence())
# to use 'c_v' we need texts, which we have saved to disk.
texts = pickle.load(open('Corpus/texts', 'rb'))
cm_wrapper = CoherenceModel(topics=topics_wrapper, texts=texts, dictionary=dictionary, coherence='c_v')
cm_DTM = CoherenceModel(topics=topics_dtm, texts=texts, dictionary=dictionary, coherence='c_v')
print ("C_v topic coherence")
print ("Wrapper coherence is ", cm_wrapper.get_coherence())
print ("DTM Python coherence is", cm_DTM.get_coherence())
複制
該部分解釋較少,詳情可見:
We also, however, have the option of passing our own model or suff stats values. Our final DTM results are heavily influenced by what we pass over here. We already know what a “Good” or “Bad” LDA model is (if not, read about it here).