中文詞頻統計與詞雲生成

作業來源：https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/2822

中文詞頻統計

1. 下載下傳一長篇中文小說。

選取的小說為《射雕英雄傳》

2. 從檔案讀取待分析文本。

3. 安裝并使用jieba進行中文分詞。

pip install jieba

import jieba

jieba.lcut(text)

4. 更新詞庫，加入所分析對象的專業詞彙。

jieba.add_word('天罡北鬥陣') #逐個添加

jieba.load_userdict(word_dict) #詞庫文本檔案

參考詞庫下載下傳位址：https://pinyin.sogou.com/dict/

轉換代碼：scel_to_text

5. 生成詞頻統計

6. 排序

7. 排除文法型詞彙，代詞、冠詞、連詞等停用詞。

stops

import jieba
from os import path
from scipy.misc import imread
import matplotlib.pyplot as plt
txt = open(r'zhongwen.txt','r',encoding='UTF-8').read() #加載待分詞的文本

# 添加停用詞檔案
stops_chinese = open(r'stops_chinese.txt','r',encoding='UTF-8').read()
stops=stops_chinese.split('\n')

txt = txt.replace("\n",'')
txt = txt.replace(" ",'')
jieba.load_userdict(r'worddict.txt')#添加詞庫
cuttxt=jieba.lcut(txt) #分詞
cutset = set(cuttxt)
tokens = [token for token in cutset if token not in stops]
cutdict={}
for w in tokens:
    cutdict[w]=cuttxt.count(w)
cutsort = list(cutdict.items())
cutsort.sort(key= lambda x:x[1],reverse=True)

#輸出Top20
for i in range(20):
    print(cutsort[i])

# 輸出到檔案
pd.DataFrame(data = wordsort).to_csv(r'sort.csv',encoding='utf-8')

from wordcloud import WordCloud,ImageColorGenerator
back_coloring_path = "photo.png" # 設定背景圖檔路徑
d = path.dirname(__file__) #設定目前項目路徑
font_path = 'C:\Windows\Fonts\SIMLI.TTF' #設定字型路徑
back_coloring = imread(path.join(d, back_coloring_path))# 設定背景圖檔

# 設定詞雲屬性
wc = WordCloud(font_path=font_path,  # 設定字型
               background_color="white",  # 背景顔色
               max_words=100,  # 詞雲顯示的最大詞數
               mask=back_coloring,  # 設定背景圖檔
               max_font_size=400,  # 字型最大值
               random_state=42,
               width=100, height=86, margin=2,# 設定圖檔預設的大小,但是如果使用背景圖檔的話,那麼儲存的圖檔大小将會按照其大小儲存,margin為詞語邊緣距離
               )
wc.generate_from_frequencies(cutdict)
image_colors = ImageColorGenerator(back_coloring)

# 以下代碼顯示圖檔
plt.imshow(wc)
plt.axis("off")
plt.show()

8. 輸出詞頻最大TOP20，把結果存放到檔案裡

Top20：

9. 生成詞雲。

生成詞雲圖如下：