python爬蟲取圖檔詳解，

2017-11-13 23:50:00

接下來會依次準備三個案例（如果要把每一個點都精通的話大約要花費一個月，我說的精通是指自己将代碼不用查資料寫出來，以下暫未整理）：

import requests,threading#多線程處理與控制

from lxml import etree

from bs4 import BeautifulSoup

#擷取源碼

def get_html(url):

#擷取網絡位址，但這個地方寫死了，怎麼辦呢，因為我們還沒有做多頁

headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'}

#上一步是模拟浏覽器資訊，固定格式，可記下來

request=requests.get(url=url,headers=headers)#對網址發送一個get請求

response=request.content#擷取源碼，比test稍微好一點

#print(response)

return response

#接下來是擷取外頁，即圖檔自身的源碼

def get_img_html(html):

soup=BeautifulSoup(html,'lxml')#解析網頁方式，自帶html.pparser

all_a=soup.findall('a',class='list-group-item randomlist')#class是關鍵字是以此處加

for i in all_a:

img_html=get_html(i['href'])#是用來擷取超連結這一部分源碼

print(img_html)

a=get_html(1)

get_img_html(a)

好了，我們已經可以擷取一部分的源碼了，這樣，我們接下來的工作是開始做多頁

def main():

for i in range(1,10):

start_html=get_html(start_url.format(i))#将前十頁的頁數傳遞進來，來擷取前十頁源碼

get_img_html(start_html)#來擷取圖檔所在的連結源碼

main()

最後是總的源碼：

from lxml import etree#解析方式，直接找到裡面的内容

get_img(img_html)

#print(img_html)

#擷取圖檔的url:

if name=='main':

未完待續，後期會有改進

本文轉自眉間雪 51CTO部落格，原文連結：http://blog.51cto.com/13348847/2044442，如需轉載請自行聯系原作者

python爬蟲取圖檔詳解，

繼續閱讀

今日頭條iOS用戶端啟動速度優化技術調研實測資料

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Windows下配置Apache的SSL服務

Mac｜Windows系統本地照片自動上傳到伺服器

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入