Python 爬蟲實戰：分析豆瓣中最新電影的影評

接觸python時間不久，做些小項目來練練手。前幾天看了《戰狼2》，發現它在最新上映的電影裡面是排行第一的，如下圖所示。準備把豆瓣上對它的影評做一個分析。

目标總覽

主要做了三件事：

抓取網頁資料
清理資料
用詞雲進行展示

使用的python版本是3.5.

一、抓取網頁資料

第一步，要對網頁進行通路，python中使用的是urllib庫。代碼如下：

from urllib import request
resp = request.urlopen('https://movie.douban.com/nowplaying/hangzhou/')
html_data = resp.read().decode('utf-8')

其中https://movie.douban.com/nowp…是豆瓣最新上映的電影頁面，可以在浏覽器中輸入該網址進行檢視。

html_data是字元串類型的變量，裡面存放了網頁的html代碼。

輸入print(html_data)可以檢視，如下圖所示：

Python 爬蟲實戰：分析豆瓣中最新電影的影評Python 爬蟲實戰：分析豆瓣中最新電影的影評

第二步，需要對得到的html代碼進行解析，得到裡面提取我們需要的資料。

在python中使用BeautifulSoup庫進行html代碼的解析。

（注：如果沒有安裝此庫，則使用pip install BeautifulSoup進行安裝即可！）

BeautifulSoup使用的格式如下：

第一個參數為需要提取資料的html，第二個參數是指定解析器，然後使用find_all()讀取html标簽中的内容。

但是html中有這麼多的标簽，該讀取哪些标簽呢？其實，最簡單的辦法是我們可以打開我們爬取網頁的html代碼，然後檢視我們需要的資料在哪個html标簽裡面，再進行讀取就可以了。如下圖所示：

Python 爬蟲實戰：分析豆瓣中最新電影的影評Python 爬蟲實戰：分析豆瓣中最新電影的影評

從上圖中可以看出在div id=”nowplaying“标簽開始是我們想要的資料，裡面有電影的名稱、評分、主演等資訊。是以相應的代碼編寫如下：

from bs4 import BeautifulSoup as bs
soup = bs(html_data, 'html.parser')    
nowplaying_movie = soup.find_all('div', id='nowplaying')
nowplaying_movie_list = nowplaying_movie[].find_all('li', class_='list-item')

其中nowplaying_movie_list 是一個清單，可以用print(nowplaying_movie_list[0])檢視裡面的内容，如下圖所示：

Python 爬蟲實戰：分析豆瓣中最新電影的影評Python 爬蟲實戰：分析豆瓣中最新電影的影評

在上圖中可以看到data-subject屬性裡面放了電影的id号碼，而在img标簽的alt屬性裡面放了電影的名字，是以我們就通過這兩個屬性來得到電影的id和名稱。（注：打開電影短評的網頁時需要用到電影的id，是以需要對它進行解析），編寫代碼如下：

nowplaying_list = [] 
for item in nowplaying_movie_list:        
        nowplaying_dict = {}        
        nowplaying_dict['id'] = item['data-subject']       
        for tag_img_item in item.find_all('img'):            
            nowplaying_dict['name'] = tag_img_item['alt']            
            nowplaying_list.append(nowplaying_dict)

其中清單nowplaying_list中就存放了最新電影的id和名稱，可以使用print(nowplaying_list)進行檢視，如下圖所示：

Python 爬蟲實戰：分析豆瓣中最新電影的影評Python 爬蟲實戰：分析豆瓣中最新電影的影評

可以看到和豆瓣網址上面是比對的。這樣就得到了最新電影的資訊了。接下來就要進行對最新電影短評進行分析了。例如《戰狼2》的短評網址為：https://movie.douban.com/subject/26363254/comments?start=0&limit=20

其中26363254就是電影的id，start=0表示評論的第0條評論。

接下來接對該網址進行解析了。打開上圖中的短評頁面的html代碼，我們發現關于評論的資料是在div标簽的comment屬性下面，如下圖所示：

Python 爬蟲實戰：分析豆瓣中最新電影的影評Python 爬蟲實戰：分析豆瓣中最新電影的影評

是以對此标簽進行解析，代碼如下：

requrl = 'https://movie.douban.com/subject/' + nowplaying_list[]['id'] + '/comments' +'?' +'start=0' + '&limit=20' 
resp = request.urlopen(requrl) 
html_data = resp.read().decode('utf-8') 
soup = bs(html_data, 'html.parser') 
comment_div_lits = soup.find_all('div', class='comment')

此時在comment_div_lits 清單中存放的就是div标簽和comment屬性下面的html代碼了。在上圖中還可以發現在p标簽下面存放了網友對電影的評論，如下圖所示:

Python 爬蟲實戰：分析豆瓣中最新電影的影評Python 爬蟲實戰：分析豆瓣中最新電影的影評

是以對comment_div_lits 代碼中的html代碼繼續進行解析，代碼如下：

eachCommentList = []; 
for item in comment_div_lits: 
        if item.find_all('p')[].string is not None:     
            eachCommentList.append(item.find_all('p')[].string)

使用print(eachCommentList)檢視eachCommentList清單中的内容，可以看到裡面存裡我們想要的影評。如下圖所示：

Python 爬蟲實戰：分析豆瓣中最新電影的影評Python 爬蟲實戰：分析豆瓣中最新電影的影評

好的，至此我們已經爬取了豆瓣最近播放電影的評論資料，接下來就要對資料進行清洗和詞雲顯示了。

二、資料清洗

為了友善進行資料進行清洗，我們将清單中的資料放在一個字元串數組中，代碼如下：

comments = ''
for k in range(len(eachCommentList)):
    comments = comments + (str(eachCommentList[k])).strip()

使用print(comments)進行檢視，如下圖所示：

Python 爬蟲實戰：分析豆瓣中最新電影的影評Python 爬蟲實戰：分析豆瓣中最新電影的影評

可以看到所有的評論已經變成一個字元串了，但是我們發現評論中還有不少的标點符号等。這些符号對我們進行詞頻統計時根本沒有用，是以要将它們清除。所用的方法是正規表達式。python中正規表達式是通過re子產品來實作的。代碼如下：

import re

pattern = re.compile(r'[\u4e00-\u9fa5]+')
filterdata = re.findall(pattern, comments)
cleaned_comments = ''.join(filterdata)

繼續使用print(cleaned_comments)語句進行檢視，如下圖所示：

Python 爬蟲實戰：分析豆瓣中最新電影的影評Python 爬蟲實戰：分析豆瓣中最新電影的影評

我們可以看到此時評論資料中已經沒有那些标點符号了，資料變得“幹淨”了很多。

是以要進行詞頻統計，是以先要進行中文分詞操作。在這裡我使用的是結巴分詞。如果沒有安裝結巴分詞，可以在控制台使用pip install jieba進行安裝。（注：可以使用pip list檢視是否安裝了這些庫）。代碼如下所示：

import jieba    #分詞包
import pandas as pd  

segment = jieba.lcut(cleaned_comments)
words_df=pd.DataFrame({'segment':segment})

因為結巴分詞要用到pandas，是以我們這裡加載了pandas包。可以使用words_df.head()檢視分詞之後的結果，如下圖所示：

Python 爬蟲實戰：分析豆瓣中最新電影的影評Python 爬蟲實戰：分析豆瓣中最新電影的影評

從上圖可以看到我們的資料中有“看”、“太”、“的”等虛詞（停用詞），而這些詞在任何場景中都是高頻時，并且沒有實際的含義，是以我們要他們進行清除。

我把停用詞放在一個stopwords.txt檔案中，将我們的資料與停用詞進行比對即可（注：隻要在百度中輸入stopwords.txt，就可以下載下傳到該檔案）。去停用詞代碼如下代碼如下：

stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3全不引用
words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

繼續使用words_df.head()語句來檢視結果，如下圖所示，停用詞已經被出去了。

Python 爬蟲實戰：分析豆瓣中最新電影的影評Python 爬蟲實戰：分析豆瓣中最新電影的影評

接下來就要進行詞頻統計了，代碼如下：

import numpy    #numpy計算包
words_stat=words_df.groupby(by=['segment'])['segment'].agg({"計數":numpy.size})
words_stat=words_stat.reset_index().sort_values(by=["計數"],ascending=False)

用words_stat.head()進行檢視，結果如下：

Python 爬蟲實戰：分析豆瓣中最新電影的影評Python 爬蟲實戰：分析豆瓣中最新電影的影評

由于我們前面隻是爬取了第一頁的評論，是以資料有點少，在最後給出的完整代碼中，我爬取了10頁的評論，所資料還是有參考價值。

三、用詞雲進行顯示

代碼如下：

import matplotlib.pyplot as plt
%matplotlib inline

import matplotlib
matplotlib.rcParams['figure.figsize'] = (, )
from wordcloud import WordCloud#詞雲包

wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=) #指定字型類型、字型大小和字型顔色
word_frequence = {x[]:x[] for x in words_stat.head().values}
word_frequence_list = []
for key in word_frequence:
    temp = (key,word_frequence[key])
    word_frequence_list.append(temp)

wordcloud=wordcloud.fit_words(word_frequence_list)
plt.imshow(wordcloud)

其中simhei.ttf使用來指定字型的，可以在百度上輸入simhei.ttf進行下載下傳後，放入程式的根目錄即可。顯示的圖像如下：

Python 爬蟲實戰：分析豆瓣中最新電影的影評Python 爬蟲實戰：分析豆瓣中最新電影的影評

完整代碼如下：

#coding:utf-8
__author__ = 'hang'

import warnings
warnings.filterwarnings("ignore")
import jieba    #分詞包
import numpy    #numpy計算包
import codecs   #codecs提供的open方法來指定打開的檔案的語言編碼，它會在讀取的時候自動轉換為内部unicode 
import re
import pandas as pd  
import matplotlib.pyplot as plt
from urllib import request
from bs4 import BeautifulSoup as bs
%matplotlib inline

import matplotlib
matplotlib.rcParams['figure.figsize'] = (, )
from wordcloud import WordCloud#詞雲包

#分析網頁函數
def getNowPlayingMovie_list():   
    resp = request.urlopen('https://movie.douban.com/nowplaying/hangzhou/')        
    html_data = resp.read().decode('utf-8')    
    soup = bs(html_data, 'html.parser')    
    nowplaying_movie = soup.find_all('div', id='nowplaying')        
    nowplaying_movie_list = nowplaying_movie[].find_all('li', class_='list-item')    
    nowplaying_list = []    
    for item in nowplaying_movie_list:        
        nowplaying_dict = {}        
        nowplaying_dict['id'] = item['data-subject']       
        for tag_img_item in item.find_all('img'):            
            nowplaying_dict['name'] = tag_img_item['alt']            
            nowplaying_list.append(nowplaying_dict)    
    return nowplaying_list

#爬取評論函數
def getCommentsById(movieId, pageNum): 
    eachCommentList = []; 
    if pageNum>: 
         start = (pageNum-) *  
    else: 
        return False 
    requrl = 'https://movie.douban.com/subject/' + movieId + '/comments' +'?' +'start=' + str(start) + '&limit=20' 
    print(requrl)
    resp = request.urlopen(requrl) 
    html_data = resp.read().decode('utf-8') 
    soup = bs(html_data, 'html.parser') 
    comment_div_lits = soup.find_all('div', class_='comment') 
    for item in comment_div_lits: 
        if item.find_all('p')[].string is not None:     
            eachCommentList.append(item.find_all('p')[].string)
    return eachCommentList

def main():
    #循環擷取第一個電影的前10頁評論
    commentList = []
    NowPlayingMovie_list = getNowPlayingMovie_list()
    for i in range():    
        num = i +  
        commentList_temp = getCommentsById(NowPlayingMovie_list[]['id'], num)
        commentList.append(commentList_temp)

    #将清單中的資料轉換為字元串
    comments = ''
    for k in range(len(commentList)):
        comments = comments + (str(commentList[k])).strip()

    #使用正規表達式去除标點符号
    pattern = re.compile(r'[\u4e00-\u9fa5]+')
    filterdata = re.findall(pattern, comments)
    cleaned_comments = ''.join(filterdata)

    #使用結巴分詞進行中文分詞
    segment = jieba.lcut(cleaned_comments)
    words_df=pd.DataFrame({'segment':segment})

    #去掉停用詞
    stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3全不引用
    words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

    #統計詞頻
    words_stat=words_df.groupby(by=['segment'])['segment'].agg({"計數":numpy.size})
    words_stat=words_stat.reset_index().sort_values(by=["計數"],ascending=False)

    #用詞雲進行顯示
    wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=)
    word_frequence = {x[]:x[] for x in words_stat.head().values}

    word_frequence_list = []
    for key in word_frequence:
        temp = (key,word_frequence[key])
        word_frequence_list.append(temp)

    wordcloud=wordcloud.fit_words(word_frequence_list)
    plt.imshow(wordcloud)

#主函數
main()

結果顯示如下：

Python 爬蟲實戰：分析豆瓣中最新電影的影評Python 爬蟲實戰：分析豆瓣中最新電影的影評

Python 爬蟲實戰：分析豆瓣中最新電影的影評Python 爬蟲實戰：分析豆瓣中最新電影的影評

Python 爬蟲實戰：分析豆瓣中最新電影的影評

一、抓取網頁資料

第一步，要對網頁進行通路，python中使用的是urllib庫。代碼如下：

第二步，需要對得到的html代碼進行解析，得到裡面提取我們需要的資料。

二、資料清洗

三、用詞雲進行顯示

繼續閱讀

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

詳解STM32單片機的堆棧

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入