天天看點

Python-QQ聊天記錄分析-jieba+wordcloudQQ聊天記錄簡單分析

QQ聊天記錄簡單分析

0. Description

  從QQ導出了和好友從2016-08-25到2017-11-18的消息記錄,85874行,也算不少。于是就有了大緻分析、可視化一下。步驟大緻如下:

  • 消息記錄檔案預處理
  • 使用jieba分詞
  • 使用wordcloud生成詞雲
  • 生成簡單圖表

  結果大緻如下:

Python-QQ聊天記錄分析-jieba+wordcloudQQ聊天記錄簡單分析
Python-QQ聊天記錄分析-jieba+wordcloudQQ聊天記錄簡單分析
Python-QQ聊天記錄分析-jieba+wordcloudQQ聊天記錄簡單分析

1. Preprocessing

  導出的檔案大概格式如下:(已去掉多餘空行)

2016-08-26 11:02:56 PM 少平

這……

2016-08-26 11:03:02 PM 少平

這bug都被你發現了

2016-08-26 11:03:04 PM C

反駁呀

2016-08-26 11:03:25 PM C

too young

2016-08-26 11:04:43 PM C

我去刷鞋子

2016-08-26 11:04:58 PM 少平

嗯嗯

好的

Observation&Notice:

  • 每條消息上都有對應發送時間和發送者
  • 清單内容
  • 一條消息内可能有換行

  由此,

  • 可以依照發送者對消息分開為聊天雙方。
  • 将各自的内容分别放在檔案中,便于後續分詞和制作詞雲。
  • 将所有聊天時間抽取出來,可以對聊天時段進行分析和圖表繪制。

Arguments:

   infile⇒ 原始導出消息記錄檔案

   outfile1⇒ 對話一方的消息記錄檔案名

   outfile2⇒ 對話另一方的消息記錄檔案名

Outputs:

  預處理後的分别儲存的消息記錄檔案(其中隻包含一方聊天内容)以及一個消息時間檔案

# -*- coding: utf-8 -*-
""" Spilt the original file into different types in good form. """

import re
import codecs

IN_FILE = './data.txt'
OUT_CONTENT_FILE_1 = './her_words.txt'
OUT_CONTENT_FILE_2 = './my_words.txt'
OUT_TIME_FILE = './time.txt'
UTF8='utf-8'
MY_NAME_PATTERN = u'少平'
TIME_PATTERN = r'\d{4,4}-\d\d-\d\d \d{1,2}:\d\d:\d\d [AP]M'
TEST_TPYE_LINE = u'2017-10-14 1:13:49 AM 少平'


def split(infile, outfile1, outfile2):
    """Spilt the original file into different types in good form."""
    out_content_file_1 = codecs.open(outfile1, 'a', encoding=UTF8)
    out_content_file_2 = codecs.open(outfile2, 'a', encoding=UTF8)
    out_time_file = codecs.open(OUT_TIME_FILE, 'a', encoding=UTF8)

    try:
        with codecs.open(infile, 'r', encoding=UTF8) as infile:
            line = infile.readline().strip()
            while line:
                if re.search(TIME_PATTERN, line) is not None: # type lines
                    time = re.search(TIME_PATTERN, line).group()
                    out_time_file.write(u'{}\n'.format(time))

                    content_line = infile.readline()
                    flag =     # stands for my words
                    if re.search(MY_NAME_PATTERN, line):
                        flag = 
                    else:
                        flag = 
                    while content_line and re.search(TIME_PATTERN, content_line) is None:
                        if flag == :
                            out_content_file_1.write(content_line)
                        else:
                            out_content_file_2.write(content_line)
                        content_line = infile.readline()
                    line = content_line

    except OSError:
        print 'error occured here.'

    out_time_file.close()
    out_content_file_1.close()
    out_content_file_2.close()


if __name__ == '__main__':
    split(IN_FILE, OUT_CONTENT_FILE_1, OUT_CONTENT_FILE_2)
           

2. Get word segmentations using jieba

  使用jieba分詞對聊天記錄進行分詞。

import codecs
import jieba

IN_FILE_NAME = ('./her_words.txt', './my_words.txt')
OUT_FILE_NAME = ('./her_words_out.txt', './my_words_out.txt')

def split(in_files, out_files):
    """Cut the lines into segmentations and save to files"""
    for in_file, out_file in zip(in_files, out_files):
        outf = codecs.open(out_file, 'a', encoding=UTF8)
        with codecs.open(in_file, 'r', encoding=UTF8) as inf:
            line = inf.readline()
            while line:
                line = line.strip()
                seg_list = jieba.cut(line, cut_all=True, HMM=True)
                for word in seg_list:
                    outf.write(word+'\n')
                line = inf.readline()
        outf.close()

if __name__ == '__main__':
    split(IN_FILE_NAME, OUT_FILE_NAME)
           

3. Make wordclouds using wordcloud

  抽取分詞結果中出現頻率最高的120個詞,使用wordcloud進行詞雲生成。并且,屏蔽部分詞語(

STOP_WORDS

),替換部分詞語(

ALTER_WORDS

)。

STOP_WORDS = [u'圖檔', u'表情', u'視窗', u'抖動', u'我要', u'小姐', u'哈哈哈', u'哈哈哈哈', u'啊啊啊', u'嘿嘿嘿']
ALTER_WORDS = {u'被替換詞1':u'替換詞1',u'被替換詞2':u'替換詞2'}
           

Arguments:

   in_files⇒ 分詞産生的結果檔案

   out_files⇒ 儲存詞雲的目标位址

   shape_files⇒ 詞雲形狀的圖檔檔案

Output:

  對話雙方各自内容的詞雲

import jieba.analyse
import numpy as np
from PIL import Image
from wordcloud import WordCloud

OUT_FILE_NAME = ('./her_words_out.txt', './my_words_out.txt')
OUT_IMG_NAME = ('./her_wordcloud.png', './my_wordcloud.png')
SHAPE_IMG_NAME = ('./YRY.png', './FBL.png')

def make_wordcould(in_files, out_files, shape_files):
    """make wordcould"""
    for in_file, out_file, shape_file in zip(in_files, out_files, shape_files):
        shape = np.array(Image.open(shape_file))
        content = codecs.open(in_file, 'r', encoding=UTF8).read()
        tags = jieba.analyse.extract_tags(content, topK=, withWeight=True)
        text = {}
        for word, freq in tags:
            if word not in STOP_WORDS:
                if word in ALTER_WORDS:
                    word = ALTER_WORDS[word]
                text[word] = freq
        wordcloud = WordCloud(background_color='white', font_path='./font.ttf', mask=shape, width=, height=).generate_from_frequencies(text)
        wordcloud.to_file(out_file)

if __name__ == '__main__':
    make_wordcould(OUT_FILE_NAME, OUT_IMG_NAME, SHAPE_IMG_NAME)
           

以下是指定的詞雲形狀(對應

WordCloud()

中的

mask

參數):

Python-QQ聊天記錄分析-jieba+wordcloudQQ聊天記錄簡單分析
Python-QQ聊天記錄分析-jieba+wordcloudQQ聊天記錄簡單分析

以下是生成的詞雲:

Python-QQ聊天記錄分析-jieba+wordcloudQQ聊天記錄簡單分析
Python-QQ聊天記錄分析-jieba+wordcloudQQ聊天記錄簡單分析

4. Generate a simple bar plot about time

  根據預進行中産生的時間檔案制作簡單柱狀圖。

#-*- coding: utf-8 -*-
""" make a simple bar plot """


import codecs
import matplotlib.pyplot as plt

FILE = 'time.txt'

def make_bar_plot(file_name):
    """make a simple bar plot"""
    time_list = {}
    message_cnt = 
    with codecs.open(file_name, 'r', encoding='utf-8') as infile:
        line = infile.readline()
        while line:
            line = line.strip()
            time_in_12, apm = line.split()[:]
            time_in_24 = time_format(time_in_12, apm)
            if time_in_24 in time_list:
                time_list[time_in_24] = time_list[time_in_24] + 
            else:
                time_list[time_in_24] = 
            line = infile.readline()
            message_cnt = message_cnt + 
    plt.figure(figsize=(, ))
    plt.bar(time_list.keys(), time_list.values(), width=,
            facecolor='lightskyblue', edgecolor='white')
    plt.xticks(range(len(time_list)), time_list.keys())

    for x_axies in time_list:
        y_axies = time_list[x_axies]
        label = '{}%'.format(round(y_axies*/message_cnt*, ))
        plt.text(x_axies, y_axies+, label, ha='center', va='bottom')
    plt.title('#message in each hour')
    plt.savefig('time.png')

def time_format(time_in_12, apm):
    """docstring"""
    hour = time_in_12.split(':')[]
    hour = int(hour)
    if apm == 'PM':
        hour = hour + 
    time_in_24 = hour % 
    return time_in_24

if __name__ == '__main__':
    make_bar_plot(FILE)
           

生成的柱狀圖如下:

Python-QQ聊天記錄分析-jieba+wordcloudQQ聊天記錄簡單分析