【Python3網絡爬蟲】抓取貓眼電影排行榜

#抓取貓眼電影排行榜前100

#目标：提取出貓眼電影TOP100的電影名稱、時間、評分、圖檔等資訊

#提取站點：http://maoyan.com/board/4 提取的結果以檔案形式儲存

#使用知識：網頁基礎、網絡基礎、urllib、requests、正規表達式

1.抓取分析：

1.網站頁面有效資訊：影片名稱主演上映時間上映地區評分圖檔一頁10條

【Python3網絡爬蟲】抓取貓眼電影排行榜

2.點選第二頁發現上方的URL http://maoyan.com/board/4?offset=10

【Python3網絡爬蟲】抓取貓眼電影排行榜

由此可見 offset代表偏移值

2.抓取首頁

import requests

def get_one_page(url):
    headers ={
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.170 Safari/537.36',
        }
    response = requests.get(url,headers = headers)
    if response.status_code ==200:
        return response.text
    return None

def main():
    url = 'http://maoyan.com/board/4'
    html = get_one_page(url)
    print(html)
    
main()

3.正則提取

先觀察一個條目的源碼：

【Python3網絡爬蟲】抓取貓眼電影排行榜

1.提取排名資訊：

【Python3網絡爬蟲】抓取貓眼電影排行榜

這裡可以寫成的正規表達式：<dd>.*?board-index.*?>(.*)</i> 發現可以一對标記隻需要寫一個就可以

2.提取電影圖檔

【Python3網絡爬蟲】抓取貓眼電影排行榜

可以看到有兩個圖檔連結，簡單的測試之後，就可以得到 data-src是電影圖檔。

正規表達式可以寫為：<dd>.*?board-index.*?>(.*)</li>.*?data-src = "(.*?)"

3.提取電影名稱

【Python3網絡爬蟲】抓取貓眼電影排行榜

正規表達式：<dd>.*?board-index.*?>(.*?)</li>.*?src="(.*?)".*?name.*?a.*?>(.*?)<a>

4.提取主演、釋出時間評分等内容：

【Python3網絡爬蟲】抓取貓眼電影排行榜

<dd>.*board-index.*?>(.*?)</i>.*?src="(.*?)".*?name.*?a.*?>(.*)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</li>(.*?)</li>.*?fraction.*?>(.*)</li>.*?<dd>

第四個就是完整的正規表達式了，裡面比對了七個資訊。接下來調用findall方法提取所有内容。

def parse_one_page(html):
    pattern = re.compile('<dd>.*board-index.*?>(.*?)</i>.*?src="(.*?)".*?name.*?a.*?>
+(.*)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?integer.*?>(.*?)</li>(.*?)</li>.*?fraction.*?>(.*)</li>.*?<dd>','re.S')
    items = re.findall(pattern,html)
    print(items)
for item in items:
    yield{
    'index':'item[0]',
    'image':'item[1]',
    'title':item[2]strip()
    'actor':item[3]strip()[3:] if len(item[3]) >3 else'',
    'time':item[4]strip()[5:] if len(item[4])>5 else '',
    'score':item[5].strip()+item[6].strip()
    }

這樣就可以成功提取電影的排名圖檔标題演員時間評分等内容并把它指派為一個字典形成結構化資料

4.寫入檔案

這裡直接寫入一個文本檔案中。這裡通過JSON庫的dumps()方法實作字典的序列化，并指定ensure_ascii參數為False,這樣就可以保證輸出結果是中文形式而不是Unicode編碼。

def write_to_file(content):
    with open('result.txt','a',encoding='utf-8') as f:
        print(type(json.dumps(content)))
        f.write(json.dumps(content,ensure_ascii=False)+'\n')

5.整合代碼

最後，實作main()方法調用前面實作的方法，将單頁的電影結果寫入到檔案。相關代碼如下：

def main()
    url = 'http://maoyan.com/board/4'
    html = get_one_page(url)
    for item in parse_one_page(html)
        write_to_file(item)

6.分頁爬取

因為需要抓取的是top100的電影，是以還需要周遊一下，給這個連結傳入offset參數，實作其他90部電影的爬取，此時添加如下調用即可：

if __name__=='__main__':
    for i in range(10):
        main(offset=i*10)

這裡還需要将main()方法修改一下，接收一個offset值作為偏移量，然後構造URL進行爬取。實作代碼如下：

def main(offset):
    url = 'http://maoyan.com/board/4?offset='+str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        write_to_file(item)

7.完整代碼：

import json
import requests
from requests.exceptions import RequestException
import re
import time


def get_one_page(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None


def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?src="(.*?)".*?name"><a'
                         + '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
                         + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield {
            'index': item[0],
            'image': item[1],
            'title': item[2],
            'actor': item[3].strip()[3:],
            'time': item[4].strip()[5:],
            'score': item[5] + item[6]
        }


def write_to_file(content):
    with open('result.txt', 'a', encoding='utf-8') as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')


def main(offset):
    url = 'http://maoyan.com/board/4?offset=' + str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        write_to_file(item)


if __name__ == '__main__':
    for i in range(10):
        main(offset=i * 10)
        time.sleep(1)

這個爬蟲是看着崔老師的書寫的，決定自己重新寫一遍，之後換一個網站爬一下。

【Python3網絡爬蟲】抓取貓眼電影排行榜

【Python3網絡爬蟲】抓取貓眼電影排行榜

繼續閱讀

【Python爬蟲3】在下載下傳的本地緩存做爬蟲下載下傳緩存1為連結爬蟲添加緩存支援2磁盤緩存3資料庫緩存

python3爬蟲(一)

python soket程式設計之一個ssh程式

54. Python 爬蟲（3）

pwnable passcode 10pt

Python3網絡爬蟲——爬蟲基本原理

【Python3爬蟲】最新的模拟登入新浪微網誌教程

python3網絡爬蟲-介紹

【Python3網絡爬蟲】 urllib庫的使用

Python3網絡爬蟲開發實戰——第1章開發環境

Selenium分頁爬取淘寶商品資訊這一節利用selenium爬取淘寶商品資訊并儲存至mongodb資料庫

【Python3網絡爬蟲】 抓取貓眼電影排行榜

繼續閱讀

【Python3網絡爬蟲】抓取貓眼電影排行榜