042 執行個體10-文本詞頻統計

一、"文本詞頻統計"問題分析

1.1 問題分析

二、"Hamlet英文詞頻統計"執行個體講解
三、"《三國演義》人物出場統計"執行個體講解(上)
四、"《三國演義》人物出場統計"執行個體講解(下)

4.1 《三國演義》人物出場統計

五、"文本詞頻統計"舉一反三

5.1 應用問題的擴充

一、"文本詞頻統計"問題分析

1.1 問題分析

文本詞頻統計

需求：一篇文章，出現了哪些詞？哪些詞出現得最多？
該怎麼做呢？

英文文本 --> 中文文本

英文文本：Hamlet 分析詞頻

中文文本：《三國演義》分析人物

二、"Hamlet英文詞頻統計"執行個體講解

文本去噪及歸一化
使用字典表達詞頻

# CalHamletV1.py


def getText():
    txt = open("hamlet.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
        txt = txt.replace(ch, " ")
    return txt


hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
    counts[word] = counts.get(word, 0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

the         948
and         855
to          650
of          581
you         494
a           468
my          447
i           443
in          373
hamlet      361

運作結果由大到小排序
觀察單詞出現次數

三、"《三國演義》人物出場統計"執行個體講解(上)

中文文本分詞

# CalThreeKingdomsV1.py

import jieba
txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word, 0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(15):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/mh/krrg51957cqgl0rhgnwyylvc0000gn/T/jieba.cache
Loading model cost 1.030 seconds.
Prefix dict has been built succesfully.


曹操          953
孔明          836
将軍          772
卻說          656
玄德          585
關公          510
丞相          491
二人          469
不可          440
荊州          425
玄德曰         390
孔明曰         390
不能          384
如此          378
張飛          358

四、"《三國演義》人物出場統計"執行個體講解(下)

4.1 《三國演義》人物出場統計

将詞頻與人物相關聯，面向問題

詞頻統計 --> 人物統計

#CalThreeKingdomsV2.py
import jieba
txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
excludes = {"将軍", "卻說", "荊州", "二人", "不可", "不能", "如此"}
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "諸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "關公" or word == "雲長":
        rword = "關羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "劉備"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword, 0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

曹操         1451
孔明         1383
劉備         1252
關羽          784
張飛          358
商議          344
如何          338
主公          331
軍士          317
呂布          300

擴充程式解決問題
根據結果進一步優化

隆重釋出《三國演義》人物出場順序前20：曹操、孔明、劉備、關羽、張飛、呂布、趙雲、孫權、司馬懿、周瑜、袁紹、馬超、魏延、黃忠、姜維、馬岱、龐德、孟獲、劉表、夏侯惇

五、"文本詞頻統計"舉一反三

5.1 應用問題的擴充

《紅樓夢》、《西遊記》、《水浒傳》…
政府工作報告、科研論文、新聞報道 …
進一步呢？未來還有詞雲…

042 執行個體10-文本詞頻統計

1.1 問題分析

4.1 《三國演義》人物出場統計

5.1 應用問題的擴充

繼續閱讀

xctf攻防世界 MISC高手進階區心儀的公司

GitHub開源！《深度學習面試指南》

微信企業号擷取使用者資訊2. 代碼示例

微信企業号開發可能遇到的問題

微信企業号--回調模式開啟php

微信企業号開發（1）

微信公衆平台<企業号>消息回複和菜單點選推圖文消息企業号和訂閱号服務号最大差別，千萬注意消息加密和解密！！！

linux mint借用deepin-wine安裝QQ/微信

案例:雙代号網絡圖在控制工程造價中的應用

【TSP-GA】基于MATLAB的TSP-GA問題優化仿真

微信開通檢測軟體使用心得與技巧

微信開通檢測工具如何檢測效果最好

2016年安全好用微信開通檢測軟體

檢測手機号是否開通微信

微信号碼開通狀态檢測助手讓人耳目一新

微信開通狀态檢測工具（免驗證碼版）運作原理