爬蟲
requests子產品
- urllib子產品:基于模拟浏覽器上網的子產品。網絡請求子產品。
- requests:基于網絡請求的子產品
- 作用:模拟浏覽器上網。
- requests子產品的編碼流程:
- 指定url
- 發起請求
- 擷取響應資料(爬取到的資料)
- 持久化存儲
頁面采集
import requests
#1.爬取搜狗首頁的頁面源碼資料
url = 'https://www.sogou.com/'
response = requests.get(url=url)
page_text = response.text #text傳回的是字元串形式的響應資料
with open('./sogou.html','w',encoding='utf-8') as fp:
fp.write(page_text)
import requests
#2.簡易的網頁采集器
#涉及到的知識點:參數動态化,UA僞裝,亂碼的處理
word = input('enter a key word:')
url = 'https://www.sogou.com/web'
#參數動态化:将請求參數封裝成字典作用到get方法的params參數中
params = {
'query':word
}
response = requests.get(url=url,params=params)
page_text = response.text
fileName = word+'.html'
with open(fileName,'w',encoding='utf-8') as fp:
fp.write(page_text)
print(word,'下載下傳成功!!!')
上述代碼出現的問題:
- 亂碼問題
- 爬取資料丢失
import requests
#亂碼處理
word = input('enter a key word:')
url = 'https://www.sogou.com/web'
#參數動态化:将請求參數封裝成字典作用到get方法的params參數中
params = {
'query':word
}
response = requests.get(url=url,params=params)
#可以修改響應資料的編碼
response.encoding = 'utf-8'#手動修改了響應對象的編碼格式
page_text = response.text
fileName = word+'.html'
with open(fileName,'w',encoding='utf-8') as fp:
fp.write(page_text)
print(word,'下載下傳成功!!!')