python爬蟲實戰：豆瓣模拟登入 + 影評爬取 + 詞雲制作

項目描述

爬取豆瓣上關于《哪吒之魔童降世》的短評，并制作詞雲。

技術點：

Python面向對象
模拟登陸，内容爬取
HTML解析利器：BeautifulSoup （對應Java中的JSoup）
分詞，并制作詞雲

學完後能做什麼：爬取網絡中任何感興趣的東西，如小說、圖檔、音樂、電影。或者其他有價值的資料，如收集電商商品資訊，做一個比較網站。

環境準備

安裝Python3.x，官網下載下傳安裝包；
安裝本次項目中使用的第三方包

pip install requests
pip install beautifulsoup4
pip install PIL
pip install pandas
pip install numpy
pip install jieba
pip install wordcloud

第三方包介紹

requests：抓取url資料

beautifulsoup4：html解析，從網頁擷取有用的資料

PIL：圖檔展示

pandas：資料處理，并儲存到表格

numpy：資料處理，矩陣操作

jieba：分詞

wordcloud：制作詞雲

豆瓣模拟登入

為什麼需要模拟登陸？

有些網站不登入的話，通路會受限。例如，在未登入情況下，豆瓣影評隻能讀取200條。

模拟登陸流程：

進入登入頁面；
打開Chrome Debug控制台（右鍵頁面，選擇“檢測”；或者使用“F12”快捷鍵）；
進行登入操作；
在Chrome Debug控制台抓取登入消息

擷取如下資訊：

登入連結：https://accounts.douban.com/j/mobile/login/basic

登入參數：

{
    \'ck\': \'\',
    \'name\': "你的豆瓣登入賬号",
    \'password\': "你的豆瓣登入密碼",
    \'remember\': \'false\',
    \'ticket\': \'\'
}

登入參考代碼：

import requests

class DouBan:
    def __init__(self):
        self.login_url = \'https://accounts.douban.com/j/mobile/login/basic\'
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"
        }
        self.login_data = {
            \'ck\': \'\',
            \'name\': "你的豆瓣登入賬号",
            \'password\': "你的豆瓣登入密碼",
            \'remember\': \'false\',
            \'ticket\': \'\'
        }
        self.session = requests.Session()
        self.login()

    def login(self):
        response = self.session.post(self.login_url, data=self.login_data, headers=self.headers)
        print(response.json())

    def get_html(self, url):
        return self.session.get(url, headers = self.headers)

影評爬取

在豆瓣查找《哪吒之魔童降世》影評連結
分析短評頁面，确定抓取次元：

使用者名 (\'.comment-info a\')[0].text
評星 (\'.rating\')[0][\'class\'][0][7:8]
評論内容 (\'.short\')[0].text
時間 (\'.comment-time\')[0].text

分頁

1）确定分頁連結

https://movie.douban.com/subject/26794435/comments?start=0&limit=20&sort=new_score&status=P

2）确定總條數（即何時結束）

隻爬取500條

from nezha.douban2 import DouBan
import time
import random
from bs4 import BeautifulSoup
import pandas as pd
import jieba
from wordcloud import WordCloud
import numpy as np
from PIL import Image

class nezha2:
    def __init__(self):
        self.comment_url = \'https://movie.douban.com/subject/26794435/comments?start=%d&limit=20&sort=new_score&status=P\'
        self.comment_count = 500
        self.douban = DouBan()

    def get_comments(self):
        comments = {\'users\': [], \'ratings\': [], \'shorts\': [], \'times\': []}
        for i in range(0, 500, 20):
            time.sleep(random.random())
            url = self.comment_url % i
            response = self.douban.get_html(url)
            print(\'進度\', i, \'條\', \'狀态是：\', response.status_code)
            soup = BeautifulSoup(response.text)
            for comment in soup.select(\'.comment-item\'):
                try:
                    user = comment.select(\'.comment-info a\')[0].text
                    rating = comment.select(\'.rating\')[0][\'class\'][0][7:8]
                    short = comment.select(\'.short\')[0].text
                    t = comment.select(\'.comment-time\')[0].text.strip()
                    # print(user, rating, short, t)
                except:
                    continue
                else:
                    comments[\'users\'].append(user)
                    comments[\'ratings\'].append(rating)
                    comments[\'shorts\'].append(short)
                    comments[\'times\'].append(t)
            # break

        comments_pd = pd.DataFrame(comments)
        # 儲存完整短評資訊
        comments_pd.to_csv(\'comments.csv\')
        # 僅儲存評論，作為後續分詞的資料源
        comments_pd[\'shorts\'].to_csv(\'shorts.csv\', index=False)

分詞

使用jieba分詞，注意要過濾掉無意義的詞語，否則會出現大量的“我，是，一”等詞語。

def word_cut(self):
    	# 添加新詞
        with open(\'data/mywords.txt\') as f:
            jieba.load_userdict(f)

		# 擷取短評資料
        with open(\'shorts.csv\', \'r\', encoding=\'utf8\') as f:
            comments = f.read()

        with open(\'data/stop.txt\') as f:
            stop_words = f.read().splitlines()

        words = []
        # 過濾無意義的詞語
        for word in jieba.cut(comments):
            if word not in stop_words:
                words.append(word)

        words = \' \'.join(words)
        return words

詞雲

使用wordcloud産生詞雲

def generate_wordcount(self):
        word_cloud = WordCloud(
            background_color=\'white\',
            font_path=\'/System/Library/Fonts/PingFang.ttc\',  # 顯示中文
            mask=np.array(Image.open(\'data/nezha.jpg\')),
            max_font_size=100
        )
        word_cloud.generate(self.word_cut())
        word_cloud.to_image().show()
        word_cloud.to_file(\'word.jpg\')

詞雲效果