coding by real mind writing by genuine heart
解析
任務背景:https://www.qiushibaike.com/hot/
窺探網頁細節:觀察每一頁URL的變化
第一頁
進入第二頁
再看看第三頁
把這些URL放在一起,觀察規律
1 https://www.qiushibaike.com/hot/page/1/
2 https://www.qiushibaike.com/hot/page/2/
3 https://www.qiushibaike.com/hot/page/3/
從圖檔可以看出,該URL其他地方不變,隻有最後的數字會改變,代表頁數
推薦使用浏覽器Chrome
插件豐富,原生功能設計對爬蟲開發者非常友好
分析網頁源代碼
通過在原來的頁面上點選,選擇“檢查”,觀察規律,這裡建議當你用elements定位元素之後,就切換到network檢視相應的元素,因為elements裡面的網頁源代碼很可能是經過JS加工過的
通過圖檔,我們發現:每一個笑話内容,都包含在一個<a...class="contentHerf"
...class="content">裡面,當然這裡的屬性不止一個,這裡我們選擇contentHerf這個屬性
思考工具:什麼工具最适合解析此種規律?BeautifulSoup
編碼
根據第一步的分析,建立初步的代碼
1 import requests
2 from bs4 import BeautifulSoup
3 import time
4 import re
5
6 headers = {
7 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
8 #User-Agent可以僞裝浏覽器,鍵值為浏覽器的引擎資訊及版本
9 'Host':'www.qiushibaike.com',
10 'Cookie':'_ga=GA1.2.2026142502.1558849033; gr_user_id=5d0a35ad-3eb6-4037-9b4d-bbc5e22c9b9f; grwng_uid=9bd612b3-7d0b-4a08-a4e1-1707e33f6995; _qqq_uuid_="2|1:0|10:1617119039|10:_qqq_uuid_|56:NjUxYWRiNDFhZTYxMjk4ZGM3MTgwYjkxMGJjNjViY2ZmZGUyNDdjMw==|fdce75d742741575ef41cd8f540465fb97b5d18891a9abb0849b3a09c530f7ee"; _xsrf=2|6d1ed4a0|7de9818067dac3b8a4e624fdd75fc972|1618129183; Hm_lvt_2670efbdd59c7e3ed3749b458cafaa37=1617119039,1617956477,1618129185; Hm_lpvt_2670efbdd59c7e3ed3749b458cafaa37=1618129185; ff2672c245bd193c6261e9ab2cd35865_gr_session_id=fd4b35b4-86d1-4e79-96f4-45bcbcbb6524; ff2672c245bd193c6261e9ab2cd35865_gr_session_id_fd4b35b4-86d1-4e79-96f4-45bcbcbb6524=true'
11 #Cookie裡面儲存了你的身份驗證資訊,可用于cookies反爬
12 }
13
14 for page in range(10):
15 url = f'https://www.qiushibaike.com/hot/page/{page}/' #f-string函數,{}中填的是變化的内容,也可以使用format函數
16 req = requests.get(url,headers=headers)
17 html = req.text
18
19 soup = BeautifulSoup(html,'lxml')
20 for joke in soup.select('.contentHerf .content span'):
21 if joke.string is not None:
22 joke_data = f'笑話一則:{joke.string.strip()}\n\n'
23 with open('../txt_file/joke.txt','ab') as f: #以追加二進制的形式寫入到文本檔案中,這樣就不會替換掉原先的内容
24 pattern = re.compile('檢視全文',re.S)
25 jok = re.sub(pattern,'這裡被替換了,嘻嘻!',joke_data)
26 f.write(jok.encode('utf-8'))
27 time.sleep(1) #延遲爬取時間
檢視爬取内容
上面這張圖檔被框起來的地方被我用正規表達式替換掉了,這裡原來的内容是“檢視全文”
代碼優化
1 import requests
2 from bs4 import BeautifulSoup
3 import re
4 import time
5 from requests.exceptions import RequestException
6
7
8 def get_url_html():
9 headers = {
10 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
11 'Host':'www.qiushibaike.com',
12 'Cookie':'_ga=GA1.2.2026142502.1558849033; gr_user_id=5d0a35ad-3eb6-4037-9b4d-bbc5e22c9b9f; grwng_uid=9bd612b3-7d0b-4a08-a4e1-1707e33f6995; _qqq_uuid_="2|1:0|10:1617119039|10:_qqq_uuid_|56:NjUxYWRiNDFhZTYxMjk4ZGM3MTgwYjkxMGJjNjViY2ZmZGUyNDdjMw==|fdce75d742741575ef41cd8f540465fb97b5d18891a9abb0849b3a09c530f7ee"; _xsrf=2|6d1ed4a0|7de9818067dac3b8a4e624fdd75fc972|1618129183; Hm_lvt_2670efbdd59c7e3ed3749b458cafaa37=1617119039,1617956477,1618129185; Hm_lpvt_2670efbdd59c7e3ed3749b458cafaa37=1618129185; ff2672c245bd193c6261e9ab2cd35865_gr_session_id=fd4b35b4-86d1-4e79-96f4-45bcbcbb6524; ff2672c245bd193c6261e9ab2cd35865_gr_session_id_fd4b35b4-86d1-4e79-96f4-45bcbcbb6524=true'
13
14 }
15
16 try:
17 for page in range(2,5):
18 url = f'https://www.qiushibaike.com/hot/page/{page}/'
19 req = requests.get(url,headers=headers)
20 if req in not None:
21 return req.text
22 else:
23 return None
24 except RequestException:
25 return None
26
27 def main():
28 html = get_url_html()
29 soup = BeautifulSoup(html,'lxml')
30 for joke in soup.select('.contentHerf .content span'):
31 if joke.string is not None:
32 joke_data = f'笑話一則:{joke.string.strip()}\n\n'
33 with open('../txt_file/joke.txt','ab') as f:
34 pattern = re.compile('檢視全文',re.S)
35 jok = re.sub(pattern,'這裡被替換了,嘻嘻!',joke_data)
36 f.write(joke.encode('utf-8'))
37
38
39
40 if __name__ == '__main__':
41 main()
42 time.sleep(1)
總結
請求庫requests及exceptions子產品
解析庫BeautifulSoup
标準庫re
time子產品
文本存儲
更多獨家精彩内容 請掃碼關注個人公衆号,我們一起成長,一起Coding,讓程式設計更有趣!
—— —— —— —— — END —— —— —— —— ————
歡迎掃碼關注我的公衆号
小鴻星空科技