python爬蟲實戰一：分析豆瓣中最新電影的影評

簡介

剛接觸python不久，做一個小項目來練練手。前幾天看了《戰狼2》，發現它在最新上映的電影裡面是排行第一的，如下圖所示。準備把豆瓣上對它的影評(短評)做一個分析。

目标總覽

主要做了三件事：

抓取網頁資料

清理資料

用詞雲進行展示

使用的python版本是3.5.

運作環境：jupyer notebook，如在其他環境下運作報錯了，請檢視評論區的讨論，裡面有一些解決辦法。

一、抓取網頁資料

第一步要對網頁進行通路，python中使用的是urllib庫。代碼如下：

from urllibimportrequestresp = request.urlopen(\'https://movie.douban.com/nowplaying/hangzhou/\')html_data = resp.read().decode(\'utf-8\')

其中https://movie.douban.com/nowp...，可以在浏覽器中輸入該網址進行檢視。

html_data是字元串類型的變量，裡面存放了網頁的html代碼。

輸入print(html_data)可以檢視，如下圖所示：

第二步，需要對得到的html代碼進行解析，從裡面提取我們需要的資料。

在python中使用BeautifulSoup庫進行html代碼的解析。

（注：如果沒有安裝此庫，則使用pip install BeautifulSoup進行安裝即可！）

BeautifulSoup使用的格式如下：

BeautifulSoup(html,"html.parser")

第一個參數為需要提取資料的html，第二個參數是指定解析器，然後使用find_all()讀取html标簽中的内容。

但是html中有這麼多的标簽，該讀取哪些标簽呢？其實，最簡單的辦法是可以打開我們爬取網頁的html代碼，然後檢視我們需要的資料在哪個html标簽裡面，再進行讀取就可以了。如下圖所示：

從上圖中可以看出在div id="nowplaying"标簽開始是我們想要的資料，裡面有電影的名稱、評分、主演等資訊。是以相應的代碼編寫如下：

frombs4importBeautifulSoupasbssoup = bs(html_data,\'html.parser\') nowplaying_movie = soup.find_all(\'div\', id=\'nowplaying\')nowplaying_movie_list = nowplaying_movie[0].find_all(\'li\', class_=\'list-item\')

其中nowplaying_movie_list 是一個清單，可以用print(nowplaying_movie_list[0])檢視裡面的内容，如下圖所示：

python學習交流群：923414804，群内每天分享幹貨，包括最新的企業級案例學習資料和零基礎入門教程，歡迎小夥伴入群學習。

在上圖中可以看到data-subject屬性裡面放了電影的id号碼，而在img标簽的alt屬性裡面放了電影的名字，是以我們就通過這兩個屬性來得到電影的id和名稱。（注：打開電影短評的網頁時需要用到電影的id，是以需要對它進行解析），編寫代碼如下：

nowplaying_list = []foriteminnowplaying_movie_list:nowplaying_dict = {} nowplaying_dict[\'id\'] = item[\'data-subject\']fortag_img_iteminitem.find_all(\'img\'): nowplaying_dict[\'name\'] = tag_img_item[\'alt\'] nowplaying_list.append(nowplaying_dict)

其中清單nowplaying_list中就存放了最新電影的id和名稱，可以使用print(nowplaying_list)進行檢視，如下圖所示：

可以看到和豆瓣網址上面是比對的。這樣就得到了最新電影的資訊了。接下來就要進行對最新電影短評進行分析了。例如《戰狼2》的短評網址為：https://movie.douban.com/subject/26363254/comments?start=0&limit=20

其中26363254就是電影的id，start=0表示評論的第0條評論。

接下來接對該網址進行解析了。打開上圖中的短評頁面的html代碼，我們發現關于評論的資料是在div标簽的comment屬性下面，如下圖所示：

是以對此标簽進行解析，代碼如下：

requrl=\'https://movie.douban.com/subject/\'+ nowplaying_list[0][\'id\'] +\'/comments\'+\'?\'+\'start=0\'+\'&limit=20\'resp= request.urlopen(requrl)html_data= resp.read().decode(\'utf-8\')soup= bs(html_data,\'html.parser\')comment_div_lits= soup.find_all(\'div\', class_=\'comment\')

此時在comment_div_lits 清單中存放的就是div标簽和comment屬性下面的html代碼了。在上圖中還可以發現在p标簽下面存放了網友對電影的評論，如下圖所示:

是以對comment_div_lits 代碼中的html代碼繼續進行解析，代碼如下：

eachCommentList = [];foritemincomment_div_lits:ifitem.find_all(\'p\')[0].stringisnotNone: eachCommentList.append(item.find_all(\'p\')[0].string)

使用print(eachCommentList)檢視eachCommentList清單中的内容，可以看到裡面存裡我們想要的影評。如下圖所示：

好的，至此我們已經爬取了豆瓣最近播放電影的評論資料，接下來就要對資料進行清洗和詞雲顯示了。

二、資料清洗

為了友善進行資料進行清洗，我們将清單中的資料放在一個字元串數組中，代碼如下：

comments =\'\'forkinrange(len(eachCommentList)): comments = comments + (str(eachCommentList[k])).strip()

使用print(comments)進行檢視，如下圖所示：

可以看到所有的評論已經變成一個字元串了，但是我們發現評論中還有不少的标點符号等。這些符号對我們進行詞頻統計時根本沒有用，是以要将它們清除。所用的方法是正規表達式。python中正規表達式是通過re子產品來實作的。代碼如下：

importrepattern = re.compile(r\'[\u4e00-\u9fa5]+\')filterdata = re.findall(pattern, comments)cleaned_comments =\'\'.join(filterdata)

繼續使用print(cleaned_comments)語句進行檢視，如下圖所示：

我們可以看到此時評論資料中已經沒有那些标點符号了，資料變得“幹淨”了很多。

是以要進行詞頻統計，是以先要進行中文分詞操作。在這裡我使用的是結巴分詞。如果沒有安裝結巴分詞，可以在控制台使用pip install jieba進行安裝。（注：可以使用pip list檢視是否安裝了這些庫）。代碼如下所示：

import jieba#分詞包import pandas as pd segment = jieba.lcut(cleaned_comments)words_df=pd.DataFrame({\'segment\':segment})

因為結巴分詞要用到pandas，是以我們這裡加載了pandas包。可以使用words_df.head()檢視分詞之後的結果，如下圖所示：

從上圖可以看到我們的資料中有“看”、“太”、“的”等虛詞（停用詞），而這些詞在任何場景中都是高頻詞，并且沒有實際的含義，是以我們要将他們清除。

我把停用詞放在一個stopwords.txt檔案中，将我們的資料與停用詞進行比對即可（注：隻要在百度中輸入stopwords.txt，就可以下載下傳到該檔案）。去停用詞代碼如下：

stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=3,sep="\t",names=[\'stopword\'], encoding=\'utf-8\')#quoting=3全不引用words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

繼續使用words_df.head()語句來檢視結果，如下圖所示，停用詞已經被除去了。

接下來就要進行詞頻統計了，代碼如下：

import numpy#numpy計算包words_stat=words_df.groupby(by=[\'segment\'])[\'segment\'].agg({"計數":numpy.size})words_stat=words_stat.reset_index().sort_values(by=["計數"],ascending=False)

用words_stat.head()進行檢視，結果如下：

由于我們前面隻是爬取了第一頁的評論，是以資料有點少，在最後給出的完整代碼中，我爬取了10頁的評論，是以資料還是有一定參考價值的。

三、用詞雲進行顯示

代碼如下：

importmatplotlib.pyplotasplt%matplotlib inlineimportmatplotlibmatplotlib.rcParams[\'figure.figsize\'] = (10.0,5.0)fromwordcloudimportWordCloud#詞雲包wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80)#指定字型類型、字型大小和字型顔色word_frequence = {x[0]:x[1]forxinwords_stat.head(1000).values}word_frequence_list = []forkeyinword_frequence: temp = (key,word_frequence[key]) word_frequence_list.append(temp)wordcloud=wordcloud.fit_words(word_frequence_list)plt.imshow(wordcloud)

其中simhei.ttf使用來指定字型的，可以在百度上輸入simhei.ttf進行下載下傳後，放入程式的根目錄即可。顯示的圖像如下：

到此為止，整個項目的介紹就結束了。由于自己也還是個初學者，接觸python不久，代碼寫的并不好。而且第一次寫技術部落格，表達的有些備援，請大家多多包涵，有不對的地方，請大家批評指正。以後我也會将自己做的小項目以這種形式寫在部落格上和大家一起交流！最後貼上完整的代碼。

完整代碼

#coding:utf-8

__author__ = \'hang\'

import warnings

warnings.filterwarnings("ignore")

import jieba #分詞包

import numpy #numpy計算包

import codecs #codecs提供的open方法來指定打開的檔案的語言編碼，它會在讀取的時候自動轉換為内部unicode

import re

import pandas as pd

import matplotlib.pyplot as plt

from urllib import request

from bs4 import BeautifulSoup as bs

%matplotlib inline

import matplotlib

matplotlib.rcParams[\'figure.figsize\'] = (10.0, 5.0)

from wordcloud import WordCloud#詞雲包

#分析網頁函數

def getNowPlayingMovie_list():

resp = request.urlopen(\'https://movie.douban.com/nowplaying/hangzhou/\')

html_data = resp.read().decode(\'utf-8\')

soup = bs(html_data, \'html.parser\')

nowplaying_movie = soup.find_all(\'div\', id=\'nowplaying\')

nowplaying_movie_list = nowplaying_movie[0].find_all(\'li\', class_=\'list-item\')

nowplaying_list = []

for item in nowplaying_movie_list:

nowplaying_dict = {}

nowplaying_dict[\'id\'] = item[\'data-subject\']

for tag_img_item in item.find_all(\'img\'):

nowplaying_dict[\'name\'] = tag_img_item[\'alt\']

nowplaying_list.append(nowplaying_dict)

return nowplaying_list

#爬取評論函數

def getCommentsById(movieId, pageNum):

eachCommentList = [];

if pageNum>0:

start = (pageNum-1) * 20

else:

return False

requrl = \'https://movie.douban.com/subject/\' + movieId + \'/comments\' +\'?\' +\'start=\' + str(start) + \'&limit=20\'

print(requrl)

resp = request.urlopen(requrl)

html_data = resp.read().decode(\'utf-8\')

soup = bs(html_data, \'html.parser\')

comment_div_lits = soup.find_all(\'div\', class_=\'comment\')

for item in comment_div_lits:

if item.find_all(\'p\')[0].string is not None:

eachCommentList.append(item.find_all(\'p\')[0].string)

return eachCommentList

def main():

#循環擷取第一個電影的前10頁評論

commentList = []

NowPlayingMovie_list = getNowPlayingMovie_list()

for i in range(10):

num = i + 1

commentList_temp = getCommentsById(NowPlayingMovie_list[0][\'id\'], num)

commentList.append(commentList_temp)

#将清單中的資料轉換為字元串

comments = \'\'

for k in range(len(commentList)):

comments = comments + (str(commentList[k])).strip()

#使用正規表達式去除标點符号

pattern = re.compile(r\'[\u4e00-\u9fa5]+\')

filterdata = re.findall(pattern, comments)

cleaned_comments = \'\'.join(filterdata)

#使用結巴分詞進行中文分詞

segment = jieba.lcut(cleaned_comments)

words_df=pd.DataFrame({\'segment\':segment})

#去掉停用詞

stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=3,sep="\t",names=[\'stopword\'], encoding=\'utf-8\')#quoting=3全不引用

words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

#統計詞頻

words_stat=words_df.groupby(by=[\'segment\'])[\'segment\'].agg({"計數":numpy.size})

words_stat=words_stat.reset_index().sort_values(by=["計數"],ascending=False)

#用詞雲進行顯示

wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80)

word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}

word_frequence_list = []

for key in word_frequence:

temp = (key,word_frequence[key])

word_frequence_list.append(temp)

wordcloud=wordcloud.fit_words(word_frequence_list)

plt.imshow(wordcloud)

#主函數

main()

結果顯示如下：

上圖基本反映了《戰狼2》這部電影的情況。PS:我本人并不喜歡這部電影，内容太空洞、太假，為了愛國而愛國，沒意思。哎，這兩年真是國産電影的低谷啊，沒有一部拿得出手的國産電影，看看人家印度拍的《摔跤吧，爸爸》那才是拍的有深度，同樣是表現愛國，國産電影還是需要向别的國家好好學學。