爬蟲

requests子產品

urllib子產品：基于模拟浏覽器上網的子產品。網絡請求子產品。
requests：基于網絡請求的子產品
- 作用：模拟浏覽器上網。
requests子產品的編碼流程：
- 指定url
發起請求
- 擷取響應資料（爬取到的資料）
持久化存儲

頁面采集

import requests

#1.爬取搜狗首頁的頁面源碼資料
url = 'https://www.sogou.com/'
response = requests.get(url=url)
page_text = response.text #text傳回的是字元串形式的響應資料
with open('./sogou.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

import requests

#2.簡易的網頁采集器
#涉及到的知識點：參數動态化，UA僞裝，亂碼的處理
word = input('enter a key word:')
url = 'https://www.sogou.com/web'
#參數動态化：将請求參數封裝成字典作用到get方法的params參數中
params = {
    'query':word
}
response = requests.get(url=url,params=params)
page_text = response.text
fileName = word+'.html'
with open(fileName,'w',encoding='utf-8') as fp:
    fp.write(page_text)
print(word,'下載下傳成功！！！')

上述代碼出現的問題：

亂碼問題
爬取資料丢失

import requests

#亂碼處理
word = input('enter a key word:')
url = 'https://www.sogou.com/web'
#參數動态化：将請求參數封裝成字典作用到get方法的params參數中
params = {
    'query':word
}
response = requests.get(url=url,params=params)
#可以修改響應資料的編碼
response.encoding = 'utf-8'#手動修改了響應對象的編碼格式
page_text = response.text
fileName = word+'.html'
with open(fileName,'w',encoding='utf-8') as fp:
    fp.write(page_text)
print(word,'下載下傳成功！！！')

頁面采集爬蟲

爬蟲

requests子產品

頁面采集

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入