最近在學習xpath,在網上找資料的時候,發現一個新手經常拿來練手的項目,爬取貓眼電影前一百名排行的資訊,很多都是跟崔慶才的很雷同,基本照抄.這裡就用xpath自己寫了一個程式,同樣也是爬取貓眼電影,擷取的資訊是一樣的,這裡提供一個另外的解法.
說實話,對于網頁資訊的比對,還是推薦用xpath,雖然正則确實也能達到效果,但是語句過于繁瑣,一不注意就比對不出東西,特别對于新手,本身就不熟悉正規表達式,錯了都找不出來,容易勸退.正則我一般用于在處理檔案,簡直神器.
下面貼代碼.
import requests
from requests.exceptions import RequestException
from lxml import etree
import csv
import re
def get_page(url):
"""
擷取網頁的源代碼
:param url:
:return:
"""
try:
headers = {
'User-Agent': 'Mozilla / 5.0(X11;Linuxx86_64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / '
'76.0.3809.100Safari / 537.36',
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
return None
except RequestException:
return None
def parse_page(text):
"""
解析網頁源代碼
:param text:
:return:
"""
html = etree.HTML(text)
movie_name = html.xpath("//p[@class='name']/a/text()")
actor = html.xpath("//p[@class='star']/text()")
actor = list(map(lambda item: re.sub('\s+', '', item), actor))
time = html.xpath("//p[@class='releasetime']/text()")
grade1 = html.xpath("//p[@class='score']/i[@class='integer']/text()")
grade2 = html.xpath("//p[@class='score']/i[@class='fraction']/text()")
new = [grade1[i] + grade2[i] for i in range(min(len(grade1), len(grade2)))]
ranking = html.xpath("///dd/i/text()")
return zip(ranking, movie_name, actor, time, new)
def change_page(number):
"""
翻頁
:param number:
:return:
"""
base_url = 'https://maoyan.com/board/4'
url = base_url + '?offset=%s' % number
return url
def save_to_csv(result, filename):
"""
儲存
:param result:
:param filename:
:return:
"""
with open('%s' % filename, 'a') as csvfile:
writer = csv.writer(csvfile, dialect='excel')
writer.writerow(result)
def main():
"""
主函數
:return:
"""
for i in range(0, 100, 10):
url = change_page(i)
text = get_page(url)
result = parse_page(text)
for j in result:
save_to_csv(j, filename='message.csv')
if __name__ == '__main__':
main()
轉載于:https://www.cnblogs.com/lattesea/p/11463236.html