Python爬蟲實戰 | (1) 爬取貓眼電影官網的TOP100電影榜單

在本篇部落格中，我們将使用requests+正規表達式來爬取貓眼電影官網的TOP100電影榜單，擷取每部電影的片名，主演，上映日期，評分和封面等内容。

打開貓眼Top100，分析URL的變化：發現Top100榜總共包含10頁，每頁10部電影，并且每一頁的URL都是有規律的，如第2頁為https://maoyan.com/board/4?offset=10，第三頁為https://maoyan.com/board/4?offset=20。由此可得第n頁為https://maoyan.com/board/4?offset=(n-1)*10。接下來我們通過爬蟲四部曲，來對其進行爬取：

首先搭建起程式的主題架構：

import csv
import json
import time
import re
import requests
from requests import RequestException

def get_one_page(url):
    pass

def parse_one_page(html):
    pass

def write_tofile(content):
    pass


if __name__ == '__main__':
    for i in range(10):
        url = 'https://maoyan.com/board/4?offset='+str(i*10)
        #發送請求 擷取響應
        html = get_one_page(url)
        #解析響應内容
        content = parse_one_page(html)
        #資料存儲
        write_tofile(content)
        #每個頁面間隔1s
        time.sleep(1)

然後逐一補全上述函數。首先是請求頁面的函數：

def get_one_page(url):
    try:
        #添加User-Agent，放在headers中，僞裝成浏覽器
        headers = {
            'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
        }
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            response.encoding = response.apparent_encoding
            return response.text
        return None
    except RequestException:
        return None

其中User-Agent的值，可以打開Chrome浏覽器，随便打開一個頁面，右鍵檢查，在Network中打開一個請求，在Headers中便可以找到，把值直接copy過去就好。

Python爬蟲實戰 | (1) 爬取貓眼電影官網的TOP100電影榜單

然後編寫頁面解析函數，首先用Chrome打開Top100榜頁面，右鍵檢查，在element頁籤中定位頁面元素：

Python爬蟲實戰 | (1) 爬取貓眼電影官網的TOP100電影榜單

點選左上方的箭頭圖示，此時将滑鼠移動到頁面的任意位置，該位置對應的html代碼就會在Elements中被定位，退出箭頭模式可以按esc。我們發現每部電影都被包含在dd标簽中，所有我們想要的資訊也都在其中：

Python爬蟲實戰 | (1) 爬取貓眼電影官網的TOP100電影榜單

接下來我們可以把dd标簽内所有的html代碼全部copy下來，把備援的部分用.*?過濾掉，留一些辨別來幫助我們定位想要的資訊，把想要的資訊放在(.*?)中，進行提取：

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?">(.*?)</i>' #序号
                        +'.*?src="(.*?)"' #圖檔位址
                        +'.*?class="name"><a.*?>(.*?)</a>' #影片名
                        +'.*?class="star">(.*?)</p>' #主演
                        +'.*?class="releasetime">(.*?)</p>'#上映時間
                        +'.*?class="score"><i class="integer">(.*?)</i>'#評分整數
                        +'.*?class="fraction">(.*?)</i></p>'#評分小數
                        +'.*?</dd>',re.S)
    items = pattern.findall(html) #傳回一個數組
    content = []
    for item in items:
        content.append({
            'index':item[0],
            'image':item[1],
            'title':item[2],
            'actor':item[3],
            'time':item[4],
            'score':item[5]+item[6]
        })
    return content

最後進行資料存儲：

def write_tofile(content):
    '''
        #存為json形式文本檔案
    # for item in content:
    #     with open('result.txt','a',encoding='utf-8') as f:
    #         f.write(json.dumps(item,ensure_ascii=False)+'\n')
    '''
    #存為csv檔案
    for item in content:
        print(item)
        with open('result.csv','a',encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow([item['index'],item['image'],item['title'],item['actor'],item['time'],item['score']])
            f.close()

完整代碼：

import csv
import json
import time
import re
import requests
from requests import RequestException

def get_one_page(url):
    try:
        #添加User-Agent，放在headers中，僞裝成浏覽器
        headers = {
            'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
        }
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            response.encoding = response.apparent_encoding
            return response.text
        return None
    except RequestException:
        return None


def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?">(.*?)</i>' #序号
                        +'.*?src="(.*?)"' #圖檔位址
                        +'.*?class="name"><a.*?>(.*?)</a>' #影片名
                        +'.*?class="star">(.*?)</p>' #主演
                        +'.*?class="releasetime">(.*?)</p>'#上映時間
                        +'.*?class="score"><i class="integer">(.*?)</i>'#評分整數
                        +'.*?class="fraction">(.*?)</i></p>'#評分小數
                        +'.*?</dd>',re.S)
    items = pattern.findall(html) #傳回一個數組
    content = []
    for item in items:
        content.append({
            'index':item[0],
            'image':item[1],
            'title':item[2],
            'actor':item[3],
            'time':item[4],
            'score':item[5]+item[6]
        })
    return content

def write_tofile(content):
    '''
        #存為json形式文本檔案
    # for item in content:
    #     with open('result.txt','a',encoding='utf-8') as f:
    #         f.write(json.dumps(item,ensure_ascii=False)+'\n')
    '''
    #存為csv檔案
    for item in content:
        print(item)
        with open('result.csv','a',encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow([item['index'],item['image'],item['title'],item['actor'],item['time'],item['score']])
            f.close()


if __name__ == '__main__':
    #寫入csv檔案 頭部
    with open('result.csv','w',encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['序号', '圖檔位址', '影片名', '主演', '上映時間', '評分'])
        f.close()
    for i in range(10):
        url = 'https://maoyan.com/board/4?offset='+str(i*10)
        print(url)
        #發送請求 擷取響應
        html = get_one_page(url)
        #解析響應内容
        content = parse_one_page(html)
        #資料存儲
        write_tofile(content)
        #每個頁面間隔1s
        time.sleep(1)

Python爬蟲實戰 | (1) 爬取貓眼電影官網的TOP100電影榜單

繼續閱讀

利用Python進行簡單爬蟲（爬取豆瓣《湮滅》短評）寫在最前爬蟲正規表達式比對做法BeautifulSoup做法最後

pyquery爬取天蠶洋芋經典玄幻三部曲鬥破蒼穹：武動乾坤：大主宰：

Python 爬蟲實戰: 爬取并下載下傳CSDN文章

C# 正規表達式詳解（學習心得 25）一、轉義字元二、字元類三、定位點四、分組構造五、限定符六、反向引用構造七、備用構造八、替換九、雜項構造十、Regex 類

C#發送電子郵件 (異步)

Boost學習之XML解析

QT實作資料總管總結

如何配置Eclipse進行Perl開發

一個不錯的 js 校驗

python 正則判斷字元串是否為版本号

個人覺得C++BuilderX是個失敗的作品

力扣每日一題：65. 有效數字題目：65. 有效數字解題思路

SQL注入風險小例

比較Flash AS3與AS2特性與功能

GSL--GNU Scientific Library

neo4j之cypher使用文檔