python爬蟲實戰 擷取豆瓣排名前250的電影資訊--基于正規表達式
一、項目目标
爬取豆瓣TOP250電影的評分、評價人數、短評等資訊,并在其儲存在txt檔案中,html解析方式基于正規表達式
二、确定頁面内容
爬蟲位址:https://movie.douban.com/top250
确定爬取内容:視訊連結,視訊名稱,導演/主演名稱,視訊評分,視訊簡介,評價人數等資訊
打開網頁,按F12鍵,可擷取以下界面資訊

觀察可知,每一部視訊的詳細資訊都存放在li标簽中
每部視訊的視訊名稱在 class屬性值為title 的span标簽裡,視訊名稱有可能有多個(中英文);
每部視訊的評分在對應li标簽裡的(唯一)一個 class屬性值為rating_num 的span标簽裡;
每部視訊的評價人數在 對應li标簽 裡的一個 class屬性值為star 的div标簽中 的最後一個數字;
每部視訊的連結在對應li标簽裡的一個a标簽裡
每部視訊的簡介在對應li标簽裡的一個class屬性值為ing的标簽裡
python 代碼如下:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2017/12/1 15:55
# @Author : gj
# @Site :
# @File : test_class.py
# @Software: PyCharm
import urllib2,re,threading
\'\'\'
僞造頭資訊
\'\'\'
def Get_header():
headers = {
\'USER_AGENT\': \'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5\'
}
return headers
\'\'\'
擷取頁面内容
\'\'\'
def Spider(url,header):
req = urllib2.Request(url=url,headers=header)
html = urllib2.urlopen(req)
info = html.read()
return info
def Analyse(infos):
pattern = re.compile(\'<ol class="grid_view">(.*?)</ol>\',re.S)
info = pattern.findall(infos)
pattern = re.compile("<li>(.*?)</li>",re.S)
movie_infos = pattern.findall(info[0])
movie=[]
for movie_info in movie_infos:
movie_temp=[]
url = ""
title=""
director=""
score=""
peoples=""
inq=""
#擷取連結位址
pattern_url = re.compile(\'<a href="(.*?)" class="">\')
movie_urls = pattern.findall(movie_info)
for movie_url in movie_urls:
url = url+movie_url
movie_temp.append(url)
# 擷取視訊名稱
pattern_title = re.compile(\'<span class="title">(.*?)</span>\')
movie_titles = pattern_title.findall(movie_info)
for movie_title in movie_titles:
title = title+movie_title
movie_temp.append(title)
# 擷取視訊演員表
pattern_director = re.compile(\'<p class="">(.*?)<br>\',re.S)
movie_directors = pattern_director.findall(movie_info)
for movie_director in movie_directors:
director = director+movie_director
movie_temp.append(director)
#擷取視訊評分
pattern_score = re.compile(\'<div class="star">.*?<span class="rating_num" property="v:average">(.*?)</span>.*?<span>(.*?)</span>.*?</div>\',re.S)
movie_scores = pattern_score.findall(movie_info)
for movie_score in movie_scores:
score = movie_score[0]
peoples = movie_score[1]
break
movie_temp.append(score)
movie_temp.append(peoples)
# 擷取視訊簡介
pattern_inq = re.compile(\'<p class="quote">.*?<span class="inq">(.*?)</span>.*?</p>\',re.S)
movie_inqs = pattern_inq.findall(movie_info)
if len(movie_inqs)>0:
inq = movie_inqs[0]
else:
inq =\'該視訊無簡介\'
movie_temp.append(inq)
movie.append(movie_temp)
return movie
\'\'\'
将傳回内容寫入檔案
\'\'\'
def write_file(infos):
#防止多個線程寫檔案造成資料錯亂
mutex.acquire()
with open("./movie.txt","ab") as f:
for info in infos:
write_info = ""
for i in range(0,len(info)):
info[i] = info[i].replace("\n","")
write_info = write_info+info[i]+" "
write_info= write_info+"\n"
f.write(write_info)
mutex.release()
def start(i):
url = "https://movie.douban.com/top250?start=%d&filter="%(i*25)
headers = Get_header()
infos= Spider(url,headers)
movie_infos = Analyse(infos)
write_file(movie_infos)
def main():
#建立多線程
Thread = []
for i in range(0,10):
t=threading.Thread(target=start,args=(i,))
Thread.append(t)
for i in range(0,10):
Thread[i].start()
for i in range(0,10):
Thread[i].join()
if __name__ == "__main__":
#加鎖
mutex = threading.Lock()
main()
最終結果會在目前目錄下生成一個movie.txt txt中記錄了每部視訊的相關資訊,大概格式如下(沒有過多的調整檔案格式,這裡面可以寫入mysql,或者寫入execl中,更加友善檢視)
以上就是基于正規表達式來擷取豆瓣排名錢250的電影資訊的爬蟲原理及簡單腳本。
發表于
2017-12-06 10:57
神經質的狗
閱讀(432)
評論(0)
編輯
收藏
舉報