擷取網絡資源
安裝好PycURL之後,我們就可以執行一些網絡操作了。最簡單的是通過一個網站的URL擷取它的相關資源。使用PycURL執行一個網絡請求,需要以下步驟:
1、建立一個pucurl.Curl的執行個體。
2、使用 setopt 來設定一些請求選項。
3、調用 perform 來執行請求。
在python2 中,我們采用以下的方法擷取網絡資源:
import pycurl
from StringIO import StringIO
buffer = StringIO()
c = pycurl.Curl()
c.setopt(c.URL, 'http://pycurl.sourceforge.net/')
c.setopt(c.WRITEDATA, buffer)
c.perform()
c.close()
body = buffer.getvalue()
# Body is a string in some encoding.
# In Python 2, we can print it without knowing what the encoding is.
print(body)
PycURL沒有對網絡的響應提供存貯機制。是以,我們必須提供一個緩存(以StringIO的形式)并且讓PycURL将内容寫入這個緩存。<p>現有的大多數PycURL代碼使用 WRITEFUNCTION 而不是WRITEDATA:c.setopt(c.WRITEFUNCTION, buffer.write)
雖然 WRITEFUNCTION 還能繼續使用,但是沒有必要了。因為PycURL 7.19.3版本中的WRITEDATA 可以使用任何具有write方法的Python類。
Python 3 版本稍微有些複雜:
import pycurl
from io import BytesIO
buffer = BytesIO()
c = pycurl.Curl()
c.setopt(c.URL, 'http://pycurl.sourceforge.net/')
c.setopt(c.WRITEDATA, buffer)
c.perform()
c.close()
body = buffer.getvalue()
# Body is a byte string.
# We have to know the encoding in order to print it to a text file
# such as standard output.
print(body.decode('iso-8859-1'))
在Python 3 中,PycURL的響應是位元組串。位元組串對于我們下載下傳二進制檔案是比較友善的,但是我們處理文本内容時必須對位元組串解碼。
在上面的例子中,我們假設内容是以iso-8859-1編碼的。檢查響應頭在現實中,我們希望使用伺服器指定的編碼格式對響應解碼而不是假設一個編碼格式解碼。我們需要檢查響應頭來提取伺服器指定的編碼格式:
<pre name="code" class="python"><pre name="code" class="python">import pycurl
import re
try:
from io import BytesIO
except ImportError:
from StringIO import StringIO as BytesIO
headers = {}
def header_function(header_line):
# HTTP standard specifies that headers are encoded in iso-8859-1.
# On Python 2, decoding step can be skipped.
# On Python 3, decoding step is required.
header_line = header_line.decode('iso-8859-1')
# Header lines include the first status line (HTTP/1.x ...).
# We are going to ignore all lines that don't have a colon in them.
# This will botch headers that are split on multiple lines...
if ':' not in header_line:
return
# Break the header line into header name and value.
name, value = header_line.split(':', 1)
# Remove whitespace that may be present.
# Header lines include the trailing newline, and there may be whitespace
# around the colon.
name = name.strip()
value = value.strip()
# Header names are case insensitive.
# Lowercase name here.
name = name.lower()
# Now we can actually record the header name and value.
headers[name] = value
buffer = BytesIO()
c = pycurl.Curl()
c.setopt(c.URL, 'http://pycurl.sourceforge.net')
c.setopt(c.WRITEFUNCTION, buffer.write)
# Set our header function.
c.setopt(c.HEADERFUNCTION, header_function)
c.perform()
c.close()
# Figure out what encoding was sent with the response, if any.
# Check against lowercased header name.
encoding = None
if 'content-type' in headers:
content_type = headers['content-type'].lower()
match = re.search('charset=(\S+)', content_type)
if match:
encoding = match.group(1)
print('Decoding using %s' % encoding)
if encoding is None:
# Default encoding for HTML is iso-8859-1.
# Other content types may have different default encoding,
# or in case of binary data, may have no encoding at all.
encoding = 'iso-8859-1'
print('Assuming encoding is %s' % encoding)
body = buffer.getvalue()
# Decode using the encoding we figured out.
print(body.decode(encoding))
不得不說,完成一個非常簡單的提取編碼格式的工作需要大量的代碼。不幸的是,因為libcurl 限制了配置設定給響應資料的記憶體,是以隻能依賴我的程式來執行這個枯燥乏味的工作。
寫入檔案如果我們要将響應資料存入檔案,隻要稍作改變就可以了:
<pre name="code" class="python">import pycurl
# As long as the file is opened in binary mode, both Python 2 and Python 3
# can write response body to it without decoding.
with open('out.html', 'wb') as f:
c = pycurl.Curl()
c.setopt(c.URL, 'http://pycurl.sourceforge.net/')
c.setopt(c.WRITEDATA, f)
c.perform()
c.close()
最重要的是以二進制方式打開檔案,響應内容不需要編碼或者解碼就可以直接寫入檔案。
跟随重定向
預設情況下,libcurl和PycURL都不會跟随重定向的内容。我們可以通過 setopt 來設定跟随重定向:
import pycurl
c = pycurl.Curl()
# Redirects to https://www.python.org/.
c.setopt(c.URL, 'http://www.python.org/')
# Follow redirect.
c.setopt(c.FOLLOWLOCATION, True)
c.perform()
c.close()
正如我們沒有設定一個寫的回調函數一樣,預設的libcurl和PycURL将響應體輸出到标準輸出上。
(使用Python 3.4.1 報錯:pycurl.error: (23, 'Failed writing body (0 != 7219)'))
設定選項
跟随重定向隻是libcurl提供的一個選項。還有好多其他的選項,點選 這裡檢視。除了少數例外,PycURL選項的名字都是從libcurl中通過去掉CURLOPT_字首得來的。是以CURLOPT_URL就成了簡單的URL。
檢測響應
我們已經介紹了檢測響應頭。其他的響應資訊可以通過 getinfo 獲得,如下所示:
import pycurl
try:
from io import BytesIO
except ImportError:
from StringIO import StringIO as BytesIO
buffer = BytesIO()
c = pycurl.Curl()
c.setopt(c.URL, 'http://pycurl.sourceforge.net/')
c.setopt(c.WRITEDATA, buffer)
c.perform()
# HTTP response code, e.g. 200.
print('Status: %d' % c.getinfo(c.RESPONSE_CODE))
# Elapsed time for the transfer.
print('Status: %f' % c.getinfo(c.TOTAL_TIME))
# getinfo must be called before close.
c.close()
在此,我們将響應内容寫到緩存,避免在标準輸出中輸出不感興趣的内容。
響應資訊都在libcurl的相關文檔上有展示。除了少數例外,PycURL的常量都是通過去掉libcurl常量的字首 CURLINFO_來命名的。是以CURLINFO_RESPONSE_CODE變為RESPONSE_CODE
檢送出資料
使用POSTFIELDS選項來送出資料。送出的資料必須先經過URL編碼格式編碼:
import pycurl
try:
# python 3
from urllib.parse import urlencode
except ImportError:
# python 2
from urllib import urlencode
c = pycurl.Curl()
c.setopt(c.URL, 'http://pycurl.sourceforge.net/tests/testpostvars.php')
post_data = {'field': 'value'}
# Form data must be provided already urlencoded.
postfields = urlencode(post_data)
# Sets request method to POST,
# Content-Type header to application/x-www-form-urlencoded
# and data to send in request body.
c.setopt(c.POSTFIELDS, postfields)
c.perform()
c.close()
POSTFIELDS自動将HTTP請求方式設定為POST方式。其他的請求方式可以通過 CUSTOMREQUEST選項設定:
c.setopt(c.CUSTOMREQUEST, 'PATCH')