Python學習之jieba、wordcloud

2023-03-20 14:21:28

在上一篇的部落格中，擷取到了大量的淘寶MM的資料，這裡我們使用這些資料來生成一個類似如下的圖檔

在這裡我們要用到結巴分詞、詞雲WordCloud和會圖庫matplotlib。直接上代碼

# coding:utf-8
from os import path
from scipy.misc import imread
import matplotlib.pyplot as plt
import jieba
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

stopwords = {}


def importStopword(filename=''):
    global stopwords
    f = open(filename, 'r', encoding='utf-8')
    line = f.readline().rstrip()

    while line:
        stopwords.setdefault(line, 0)
        stopwords[line] = 1
        line = f.readline().rstrip()
    f.close()


def processChinese(textContent):
    jieba.enable_parallel(4)
    seg_generator = jieba.cut(textContent)  # 使用結巴分詞，也可以不使用
    seg_list = [i for i in seg_generator if i not in stopwords]
    seg_list = [i for i in seg_list if i != u' ']
    seg_list = r' '.join(seg_list)
    return seg_list


importStopword(filename='./stopwords1.txt')

# 擷取目前檔案路徑
# __file__ 為目前檔案, 在ide中運作此行會報錯,可改為
# d = path.dirname('.')
d = path.dirname(__file__)

text = open(path.join(d, 'data.txt'), encoding='utf-8').read()

# 如果是中文
text = processChinese(text)  # 中文不好分詞，使用Jieba分詞進行

# 設定背景圖檔
back_coloring = imread(path.join(d, "./image/nv.png"))

wc = WordCloud(font_path='./font/葉立群幾何體.ttf',  # 設定字型
               background_color="white",  # 背景顔色
               max_words=2000,  # 詞雲顯示的最大詞數
               mask=back_coloring,  # 設定背景圖檔
               # max_font_size=100, #字型最大值
               random_state=42,
               )
# 生成詞雲, 可以用generate輸入全部文本(中文不好分詞),也可以我們計算好詞頻後使用generate_from_frequencies函數
wc.generate(text)
# wc.generate_from_frequencies(txt_freq)
# txt_freq例子為[('詞a', 100),('詞b', 90),('詞c', 80)]
# 從背景圖檔生成顔色值
image_colors = ImageColorGenerator(back_coloring)
# 繪制詞雲
plt.figure()
# 以下代碼顯示圖檔
plt.imshow(wc)
plt.axis("off")
plt.show()

# 儲存圖檔
wc.to_file(path.join(d, "名稱.png"))

stopwords1.txt 内容就是要過濾掉的不要的資訊

name
order
age
place
birthday
style
size
place
fans
flow
active
profession
blood_type
school
&
nbsp
公曆
blood
type
name

背景圖使用如下的：（要使用邊界分明的圖檔，不然容易生成方形的圖檔）

Python學習之jieba、wordcloud

程式生成如下的圖檔

Python學習之jieba、wordcloud

如果對比收集到的資料，這個圖是能說明一些問題的，比如我們的資料裡有MM的位址、血型、職業、風格等等的一些資訊，圖檔中放大顯示的就是我們的資料裡占比比較大的資料。

Python學習之jieba、wordcloud

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入