基本庫的使用

python3之子產品urllib

閱讀目錄

1、urllib.request.urlopen()

2、urllib.request.Requset()

3、urllib.request的進階類

4、異常處理

5、解析連結

6、分析Robots協定

urllib是python内置的HTTP請求庫，無需安裝即可使用，它包含了4個子產品：

request：它是最基本的http請求子產品，用來模拟發送請求

error：異常處理子產品，如果出現錯誤可以捕獲這些異常

parse：一個工具子產品，提供了許多URL處理方法，如：拆分、解析、合并等

robotparser：主要用來識别網站的robots.txt檔案，然後判斷哪些網站可以爬

1、urllib.request.urlopen()

urllib.request.urlopen(url,data=None,[timeout,],cafile=None,capath=None,cadefault=False,context=None)

請求對象，傳回一個HTTPResponse類型的對象，包含的方法和屬性：

方法：read()、readinto()、getheader(name)、getheaders()、fileno()

屬性：msg、version、status、reason、bebuglevel、closed

複制代碼

import urllib.request

response=urllib.request.urlopen(‘https://www.python.org’) #請求站點獲得一個HTTPResponse對象

#print(response.read().decode(‘utf-8’)) #傳回網頁内容

#print(response.getheader(‘server’)) #傳回響應頭中的server值

#print(response.getheaders()) #以清單元祖對的形式傳回響應頭資訊

#print(response.fileno()) #傳回檔案描述符

#print(response.version) #傳回版本資訊

#print(response.status) #傳回狀态碼200，404代表網頁未找到

#print(response.debuglevel) #傳回調試等級

#print(response.closed) #傳回對象是否關閉布爾值

#print(response.geturl()) #傳回檢索的URL

#print(response.info()) #傳回網頁的頭資訊

#print(response.getcode()) #傳回響應的HTTP狀态碼

#print(response.msg) #通路成功則傳回ok

#print(response.reason) #傳回狀态資訊

複制代碼

urlopen()方法可傳遞參數：

url：網站位址，str類型，也可以是一個request對象

data：data參數是可選的，内容為位元組流編碼格式的即bytes類型，如果傳遞data參數，urlopen将使用Post方式請求

複制代碼

from urllib.request import urlopen

import urllib.parse

data = bytes(urllib.parse.urlencode({‘word’:‘hello’}),encoding=‘utf-8’)

#data需要位元組類型的參數，使用bytes()函數轉換為位元組，使用urllib.parse子產品裡的urlencode()方法來講參數字典轉換為字元串并指定編碼

response = urlopen(‘http://httpbin.org/post’,data=data)

print(response.read())

#output

b’{

“args”:{},

“data”:"",

“files”:{},

“form”:{“word”:“hello”}, #form字段表明模拟以表單的方法送出資料，post方式傳輸資料

“headers”:{“Accept-Encoding”:“identity”,

“Connection”:“close”,

“Content-Length”:“10”,

“Content-Type”:“application/x-www-form-urlencoded”,

“Host”:“httpbin.org”,

“User-Agent”:“Python-urllib/3.5”},

“json”:null,

“origin”:“114.245.157.49”,

“url”:“http://httpbin.org/post”}\n’

複制代碼

timeout參數：用于設定逾時時間，機關為秒，如果請求超出了設定時間還未得到響應則抛出異常，支援HTTP,HTTPS,FTP請求

複制代碼

import urllib.request

response=urllib.request.urlopen(‘http://httpbin.org/get’,timeout=0.1) #設定逾時時間為0.1秒,将抛出異常

print(response.read())

#output

urllib.error.URLError:

#可以使用異常處理來捕獲異常

import urllib.request

import urllib.error

import socket

try:

response=urllib.request.urlopen(‘http://httpbin.org/get’,timeout=0.1)

print(response.read())

except urllib.error.URLError as e:

if isinstance(e.reason,socket.timeout): #判斷對象是否為類的執行個體

print(e.reason) #傳回錯誤資訊

#output

timed out

複制代碼

其他參數：context參數，她必須是ssl.SSLContext類型，用來指定SSL設定，此外,cafile和capath這兩個參數分别指定CA憑證和它的路徑，會在https連結時用到。

回到頂部

2、urllib.request.Requset()

urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)

參數：

url：請求的URL，必須傳遞的參數，其他都是可選參數

data：上傳的資料，必須傳bytes位元組流類型的資料，如果它是字典，可以先用urllib.parse子產品裡的urlencode()編碼

headers：它是一個字典，傳遞的是請求頭資料，可以通過它構造請求頭，也可以通過調用請求執行個體的方法add_header()來添加

例如：修改User_Agent頭的值來僞裝浏覽器，比如火狐浏覽器可以這樣設定：

{‘User-Agent’:‘Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)’}

origin_req_host：指請求方的host名稱或者IP位址

unverifiable：表示這個請求是否是無法驗證的，預設為False，如我們請求一張圖檔如果沒有權限擷取圖檔那它的值就是true

method：是一個字元串，用來訓示請求使用的方法，如：GET,POST,PUT等

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib import request,parse

url=‘http://httpbin.org/post’

headers={

‘User-Agent’:‘Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)’,

‘Host’:‘httpbin.org’

} #定義頭資訊

dict={‘name’:‘germey’}

data = bytes(parse.urlencode(dict),encoding=‘utf-8’)

req = request.Request(url=url,data=data,headers=headers,method=‘POST’)

#req.add_header(‘User-Agent’,‘Mozilla/5.0 (compatible; MSIE 8.4; Windows NT’) #也可以request的方法來添加

response = request.urlopen(req)

print(response.read())

複制代碼

回到頂部

3、urllib.request的進階類

在urllib.request子產品裡的BaseHandler類，他是所有其他Handler的父類，他是一個處理器，比如用它來處理登入驗證，處理cookies，代理設定，重定向等

它提供了直接使用和派生類使用的方法：

add_parent(director)：添加director作為父類

close()：關閉它的父類

parent()：打開使用不同的協定或處理錯誤

defautl_open(req)：捕獲所有的URL及子類，在協定打開之前調用

Handler的子類包括：

HTTPDefaultErrorHandler：用來處理http響應錯誤，錯誤會抛出HTTPError類的異常

HTTPRedirectHandler：用于處理重定向

HTTPCookieProcessor：用于處理cookies

ProxyHandler：用于設定代理，預設代理為空

HTTPPasswordMgr：永遠管理密碼，它維護使用者名和密碼表

HTTPBasicAuthHandler：使用者管理認證，如果一個連結打開時需要認證，可以使用它來實作驗證功能

OpenerDirector類是用來處理URL的進階類，它分三個階段來打開URL：

在每個階段中調用這些方法的順序是通過對處理程式執行個體進行排序來确定的；每個使用此類方法的程式都會調用protocol_request()方法來預處理請求，然後調用protocol_open()來處理請求，最後調用protocol_response()方法來處理響應。

之前的urlopen()方法就是urllib提供的一個Opener，通過Handler處理器來建構Opener實作Cookies處理,代理設定，密碼設定等

Opener的方法包括：

add_handler(handler)：添加處理程式到連結中

open(url,data=None[,timeout])：打開給定的URL與urlopen()方法相同

error(proto,*args)：處理給定協定的錯誤

更多Request内容…

密碼驗證：

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener

from urllib.error import URLError

username=‘username’

passowrd=‘password’

url=‘http://localhost’

p=HTTPPasswordMgrWithDefaultRealm() #構造密碼管理執行個體

p.add_password(None,url,username,passowrd) #添加使用者名和密碼到執行個體中

auth_handler=HTTPBasicAuthHandler§ #傳遞密碼管理執行個體建構一個驗證執行個體

opener=build_opener(auth_handler) #建構一個Opener

try:

result=opener.open(url) #打開連結，完成驗證，傳回的結果是驗證後的頁面内容

html=result.read().decode(‘utf-8’)

print(html)

except URLError as e:

print(e.reason)

複制代碼

代理設定：

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib.error import URLError

from urllib.request import ProxyHandler,build_opener

proxy_handler=ProxyHandler({

‘http’:‘http://127.0.0.1:8888’,

‘https’:‘http://127.0.0.1:9999’

})

opener=build_opener(proxy_handler) #構造一個Opener

try:

response=opener.open(‘https://www.baidu.com’)

print(response.read().decode(‘utf-8’))

except URLError as e:

print(e.reason)

複制代碼

Cookies：

擷取網站的Cookies

複制代碼

#!/usr/bin/env python

#coding:utf8

import http.cookiejar,urllib.request

cookie=http.cookiejar.CookieJar() #執行個體化cookiejar對象

handler=urllib.request.HTTPCookieProcessor(cookie) #建構一個handler

opener=urllib.request.build_opener(handler) #建構Opener

response=opener.open(‘http://www.baidu.com’) #請求

print(cookie)

for item in cookie:

print(item.name+"="+item.value)

複制代碼

Mozilla型浏覽器的cookies格式，儲存到檔案：

複制代碼

#!/usr/bin/env python

#coding:utf8

import http.cookiejar,urllib.request

fielname=‘cookies.txt’

cookie=http.cookiejar.MozillaCookieJar(filename=fielname) #建立儲存cookie的執行個體，儲存浏覽器類型的Mozilla的cookie格式

#cookie=http.cookiejar.CookieJar() #執行個體化cookiejar對象

handler=urllib.request.HTTPCookieProcessor(cookie) #建構一個handler

opener=urllib.request.build_opener(handler) #建構Opener

response=opener.open(‘http://www.baidu.com’) #請求

cookie.save(ignore_discard=True,ignore_expires=True)

複制代碼

也可以儲存為libwww-perl(LWP)格式的Cookies檔案

cookie=http.cookiejar.LWPCookieJar(filename=fielname)

從檔案中讀取cookies：

複制代碼

#!/usr/bin/env python

#coding:utf8

import http.cookiejar,urllib.request

#fielname=‘cookiesLWP.txt’

#cookie=http.cookiejar.MozillaCookieJar(filename=fielname) #建立儲存cookie的執行個體，儲存浏覽器類型的Mozilla的cookie格式

#cookie=http.cookiejar.LWPCookieJar(filename=fielname) #LWP格式的cookies

#cookie=http.cookiejar.CookieJar() #執行個體化cookiejar對象

cookie=http.cookiejar.LWPCookieJar()

cookie.load(‘cookiesLWP.txt’,ignore_discard=True,ignore_expires=True)

handler=urllib.request.HTTPCookieProcessor(cookie) #建構一個handler

opener=urllib.request.build_opener(handler) #建構Opener

response=opener.open(‘http://www.baidu.com’) #請求

print(response.read().decode(‘utf-8’))

複制代碼

回到頂部

4、異常處理

urllib的error子產品定義了由request子產品産生的異常，如果出現問題，request子產品便會抛出error子產品中定義的異常。

1）URLError

URLError類來自urllib庫的error子產品，它繼承自OSError類，是error異常子產品的基類，由request子產品産生的異常都可以通過捕獲這個類來處理

它隻有一個屬性reason，即傳回錯誤的原因

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib import request,error

try:

response=request.urlopen(‘https://hehe,com/index’)

except error.URLError as e:

print(e.reason) #如果網頁不存在不會抛出異常，而是傳回捕獲的異常錯誤的原因(Not Found)

複制代碼

reason如逾時則傳回一個對象

複制代碼

#!/usr/bin/env python

#coding:utf8

import socket

import urllib.request

import urllib.error

try:

response=urllib.request.urlopen(‘https://www.baidu.com’,timeout=0.001)

except urllib.error.URLError as e:

print(e.reason)

if isinstance(e.reason,socket.timeout):

print(‘time out’)

複制代碼

2）HTTPError

它是URLError的子類，專門用來處理HTTP請求錯誤，比如認證請求失敗，它有3個屬性：

code：傳回HTTP的狀态碼，如404頁面不存在，500伺服器錯誤等

reason：同父類，傳回錯誤的原因

headers：傳回請求頭

更多error内容…

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib import request,error

try:

response=request.urlopen(‘http://cuiqingcai.com/index.htm’)

except error.HTTPError as e: #先捕獲子類異常

print(e.reason,e.code,e.headers,sep=’\n’)

except error.URLError as e: #再捕獲父類異常

print(e.reason)

else:

print(‘request successfully’)

複制代碼

回到頂部

5、解析連結

urllib庫提供了parse子產品，它定義了處理URL的标準接口，如實作URL各部分的抽取，合并以及連結轉換，它支援如下協定的URL處理：file,ftp,gopher,hdl,http,https,imap,mailto,mms,news,nntp,prospero,rsync,rtsp,rtspu,sftp,sip,sips,snews,svn,snv+ssh,telnet,wais

urllib.parse.urlparse(urlstring,scheme=’’,allow_fragments=True)

通過urlparse的API可以看到，它還可以傳遞3個參數

urlstring：待解析的URL，字元串

scheme：它是預設的協定，如http或者https，URL如果不帶http協定，可以通過scheme來指定，如果URL中制定了http協定則URL中生效

allow_fragments：是否忽略fragment即錨點，如果設定為False，fragment部分會被忽略，反之不忽略

更多parse子產品内容…

1）urlparse()

該方法可以實作URL的識别和分段，分别是scheme(協定),netloc(域名),path(路徑),params(參數),query(查詢條件),fragment(錨點)

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib.parse import urlparse

result=urlparse(‘http://www.baidu.com/index.html;user?id=5#comment’)

print(type(result),result,sep=’\n’) #傳回的是一個元祖

print(result.scheme,result[0]) #可以通過屬性或者索引來擷取值

print(result.netloc,result[1])

print(result.path,result[2])

print(result.params,result[3])

print(result.query,result[4])

print(result.fragment,result[5])

#output

#傳回結果是一個parseresult類型的對象，它包含6個部分，

#分别是scheme(協定),netloc(域名),path(路徑),params(參數),query(查詢條件),fragment(錨點)

ParseResult(scheme=‘http’, netloc=‘www.baidu.com’, path=’/index.html’, params=‘user’, query=‘id=5’, fragment=‘comment’)

http http

www.baidu.com www.baidu.com

/index.html /index.html

user user

id=5 id=5

comment comment

複制代碼

指定scheme協定，allow_fragments忽略錨點資訊：

複制代碼

from urllib.parse import urlparse

result=urlparse(‘www.baidu.com/index.html;user?id=5#comment’,scheme=‘https’,allow_fragments=False)

print(result)

#output

ParseResult(scheme=‘https’, netloc=’’, path=‘www.baidu.com/index.html’, params=‘user’, query=‘id=5#comment’, fragment=’’)

複制代碼

2）urlunparse()

與urlparse()相反，通過清單或者元祖的形式接受一個可疊代的對象，實作URL構造

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib.parse import urlunparse

data=[‘http’,‘www.baidu.com’,‘index.html’,‘user’,‘a=6’,‘comment’]

print(urlunparse(data)) #構造一個完整的URL

#output

http://www.baidu.com/index.html;user?a=6#comment

複制代碼

3)urlsplit()

與urlparse()方法類似，它會傳回5個部分，把params合并到path中

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib.parse import urlsplit

result=urlsplit(‘http://www.baidu.com/index.html;user?id=5#comment’)

print(result)

#output

SplitResult(scheme=‘http’, netloc=‘www.baidu.com’, path=’/index.html;user’, query=‘id=5’, fragment=‘comment’)

複制代碼

4)urlunsplit()

與urlunparse()類似，它也是将連結的各部分組合完整的連結的方法，傳入的參數也是可疊代的對象，如清單元祖等，唯一的差別是長度必須是5個，它省略了params

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib.parse import urlsplit,urlunsplit

data=[‘http’,‘www.baidu.com’,‘index.html’,‘a=5’,‘comment’]

result=urlunsplit(data)

print(result)

#output

http://www.baidu.com/index.html?a=5#comment

複制代碼

5)urljoin()

通過将基本URL（base）與另一個URL(url)組合起來建構完整URL，它會使用基本URL元件，協定(schemm)、域名(netloc)、路徑(path)、來提供給URL中缺失的部分進行補充，最後傳回結果

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib.parse import urljoin

print(urljoin(‘http://www.baidu.com’,‘index.html’))

print(urljoin(‘http://www.baidu.com’,‘http://cdblogs.com/index.html’))

print(urljoin(‘http://www.baidu.com/home.html’,‘https://cnblog.com/index.html’))

print(urljoin(‘http://www.baidu.com?id=3’,‘https://cnblog.com/index.html?id=6’))

print(urljoin(‘http://www.baidu.com’,’?id=2#comment’))

print(urljoin(‘www.baidu.com’,‘https://cnblog.com/index.html?id=6’))

#output

http://www.baidu.com/index.html

http://cdblogs.com/index.html

https://cnblog.com/index.html

https://cnblog.com/index.html?id=6

http://www.baidu.com?id=2#comment

https://cnblog.com/index.html?id=6

複制代碼

base_url提供了三項内容scheme,netloc,path，如果這3項在新的連結中不存在就給予補充，如果新的連結存在就使用新的連結部分，而base_url中的params,query和fragment是不起作用的。通過urljoin()方法可以實作連結的解析、拼接和生成

6)urlencode()

urlencode()在構造GET請求參數時很有用，它可以将字典轉化為GET請求參數

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib.parse import urlencode

params = {‘username’:‘zs’,‘password’:‘123’}

base_url=‘http://www.baidu.com’

url=base_url+’?’+urlencode(params) #将字典轉化為get參數

print(url)

#output

http://www.baidu.com?password=123&username=zs

複制代碼

7)parse_qs()

parse_qs()與urlencode()正好相反，它是用來反序列化的，如将GET參數轉換回字典格式

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib.parse import urlencode,parse_qs,urlsplit

params = {‘username’:‘zs’,‘password’:‘123’}

base_url=‘http://www.baidu.com’

url=base_url+’?’+urlencode(params) #将字典轉化為get參數

query=urlsplit(url).query #獲去URL的query參數條件

print(parse_qs(query)) #根據擷取的GET參數轉換為字典格式

#output

{‘username’: [‘zs’], ‘password’: [‘123’]}

複制代碼

8)parse_qsl()它将參數轉換為元祖組成的清單

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib.parse import urlencode,urlsplit,parse_qsl

params = {‘username’:‘zs’,‘password’:‘123’}

base_url=‘http://www.baidu.com’

url=base_url+’?’+urlencode(params) #将字典轉化為get參數

query=urlsplit(url).query #獲去URL的query參數條件

print(parse_qsl(query)) #将轉換成清單形式的元祖對

#output

[(‘username’, ‘zs’), (‘password’, ‘123’)]

複制代碼

9)quote()：該方法可以将内容轉換為URL編碼的格式，如參數中帶有中文時，有時會導緻亂碼的問題，此時用這個方法将中文字元轉化為URL編碼

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib.parse import quote

key=‘中文’

url=‘https://www.baidu.com/s?key=’+quote(key)

print(url)

#output

https://www.baidu.com/s?key=%E4%B8%AD%E6%96%87

複制代碼

10)unquote()：與quote()相反，他用來進行URL解碼

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib.parse import quote,urlsplit,unquote

key=‘中文’

url=‘https://www.baidu.com/s?key=’+quote(key)

print(url)

unq=urlsplit(url).query.split(’=’)[1] #擷取參數值

print(unquote(unq)) #解碼參數

複制代碼

回到頂部

6、分析Robots協定

利用urllib的robotparser子產品，我們可以實作網站Robots協定的分析

1）Robots協定

Robots協定也稱為爬蟲協定、機器人協定，它的全名叫做網絡爬蟲排除标準(Robots Exclusion Protocol)，用來告訴爬蟲和搜尋引擎哪些網頁可以抓取，哪些不可以抓取，它通常是一個robots.txt的文本檔案，一般放在網站的根目錄下。

當搜尋爬蟲通路一個站點時，它首先會檢查這個站點根目錄下是否存在robots.txt檔案，如果存在，搜尋爬蟲會根據其中定義的爬去範圍來爬取，如果沒有找到，搜尋爬蟲會通路所有可直接通路的頁面

我們來看下robots.txt的樣例：

User-agent: *

Disallow: /

Allow: /public/

它實作了對所有搜尋爬蟲隻允許爬取public目錄的功能，将上述内容儲存為robots.txt檔案放在網站根目錄下，和網站的入口檔案（index.html）放在一起

User-agent描述了搜尋爬蟲的名稱，将其設定為*則代表協定對任何爬蟲有效，如設定為Baiduspider則代表規則對百度爬蟲有效，如果有多條則對多個爬蟲受到限制，但至少需要指定一條

一些常見的搜尋爬蟲名稱：

BaiduSpider　　百度爬蟲 www.baidu.com

Googlebot　　Google爬蟲 www.google.com

360Spider　　360爬蟲 www.so.com

YodaoBot　　有道爬蟲 www.youdao.com

ia_archiver　　Alexa爬蟲 www.alexa.cn

Scooter　　altavista爬蟲 www.altavista.com

Disallow指定了不允許抓取的目錄，如上例中設定的/則代表不允許抓取所有的頁面

Allow一般和Disallow一起使用，用來排除單獨的某些限制，如上例中設定為/public/則表示所有頁面不允許抓取，但可以抓取public目錄

設定示例：

複制代碼

#禁止所有爬蟲

User-agent: *

Disallow: /

#允許所有爬蟲通路任何目錄,另外把檔案留白也可以

User-agent: *

Disallow:

#禁止所有爬蟲通路某那些目錄

User-agent: *

Disallow: /home/

Disallow: /tmp/

#隻允許某一個爬蟲通路

User-agent: BaiduSpider

Disallow:

User-agent: *

Disallow: /

複制代碼

2）robotparser

rebotparser子產品用來解析robots.txt，該子產品提供了一個類RobotFileParser，它可以根據某網站的robots.txt檔案來判斷一個抓取爬蟲時都有權限來抓取這個網頁

urllib.robotparser.RobotFileParser(url=’’)

robotparser類常用的方法：

set_url()：用來設定robots.txt檔案的連接配接，如果在建立RobotFileParser對象是傳入了連接配接，就不需要在使用這個方法設定了

read()：讀取reobts.txt檔案并進行分析，它不會傳回任何内容，但執行那個了讀取和分析操作

parse()：用來解析robots.txt檔案，傳入的參數是robots.txt某些行的内容，并安裝文法規則來分析内容

can_fetch()：該方法傳入兩個參數，第一個是User-agent，第二個是要抓取的URL，傳回的内容是該搜尋引擎是否可以抓取這個url,結果為True或False

mtime()：傳回上次抓取和分析robots.txt的時間

modified()：将目前時間設定為上次抓取和分析robots.txt的時間

複制代碼

#!/usr/bin/env python

#coding:utf8

from urllib.robotparser import RobotFileParser

rp = RobotFileParser() #建立對象

rp.set_url(‘https://www.cnblogs.com/robots.txt’) #設定robots.txt連接配接，也可以在建立對象時指定

rp.read() #讀取和解析檔案

print(rp.can_fetch(’*’,‘https://i.cnblogs.com/EditPosts.aspx?postid=9170312&update=1’)) #堅持連結是否可以被抓取

複制代碼

’

基本庫的使用

繼續閱讀

Python爬蟲之網站超清圖檔爬取(2021.3.29)

Python入門級爬取百度百科詞條

16Python爬蟲---Scrapy常用指令

Python爬蟲基本庫的使用第二章基本庫的使用

Python爬蟲（四）lxml、xpath安裝子產品導入查找節點屬性查找 @ 符号使用謂語選取未知節點擷取文本和屬性

爬蟲學習之04-request子產品擷取糗事百科一張熱圖

python3下用selenium庫和chrome的headless模式實作網頁抓取（注釋中有用phantomJS的小段代碼）

【Python爬蟲案例學習19】多程序爬取某圖檔網站

python爬蟲實戰之爬取成語大全

【爬取百度首頁】-将整個html源碼儲存-headers使用一、網頁分析二、代碼實作與步驟三、結果分析

爬取百度貼吧

爬取貓眼電影--靜态網頁反爬與多線程/多程序爬取網頁解析爬取代碼多線程與多程序

requests子產品進行人人網模拟登陸

2023爬蟲學習筆記 -- 多線程操作

Python爬蟲學習（1）

Boss直聘Python爬蟲實戰