第21天—Python爬蟲—requests庫

requests庫是一個Python第三方庫，處理URL資源特别友善。

進入終端輸入下面的指令，安裝requests三方庫

pip install requests

Google Chrome浏覽器

安裝Google Chrome浏覽器

Google Chrome浏覽器能夠幫助我們快速擷取想要資料的位置

進入Google Chrome浏覽器滑鼠點選右鍵我們可以選擇檢視網頁源碼選項和檢查選項

第21天---Python爬蟲---requests庫第21天—Python爬蟲—requests庫

檢查選項也可以使用鍵盤上的f12快速進入

當我們使用檢查選項時

點選下圖所示紅色方塊那，把滑鼠移動到我們想擷取的地方，會自動告訴我們在網頁源碼中的位置

第21天---Python爬蟲---requests庫第21天—Python爬蟲—requests庫

擷取headers資訊

headers

資訊，其中包含了

User-Agent

字段資訊，也就是浏覽器辨別資訊。如果不加這個，一些網址會禁止抓取資料。

進入網站 —> f12進入檢查 —> 點選重新整理 —> 點選Network —> 選擇all —> 點一下前面一段 —> 選擇其中一個檔案 —>點選Headers —> 滾動到最下方就會找到

User-Agent

字段資訊

第21天---Python爬蟲---requests庫第21天—Python爬蟲—requests庫

使用requests

這裡我們使用豆瓣電影網址

import requests
# 網址
URL = 'https://movie.douban.com/top250'
# headers資訊
headers = {
	'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
}
# 通路網頁
resp = requests.get(url=URL, headers=headers)
# 列印網頁源碼
print(resp.text)

通過使用requests庫，我們使用正規表達式對我們想要的資料進行抓取

通過Chrome浏覽器的檢查，找到我們需要資料的位置

第21天---Python爬蟲---requests庫第21天—Python爬蟲—requests庫

# 使用正規表達式得導入re
import re
# 正規表達式：将需要的資料使用（）進行封裝
re_str = '<img width="100" alt="(.+?)" src="([a-z]{5}.{3}[a-z\d]{4}\.[a-z]{8}\.[a-z]{3}/view/photo/s_ratio_poster/public/p\d{9}.jpg)" class="">'
# 比對網頁和标題
result = re.search(re_str, content)
print(result)
# span():輸出比對到的字元串的起始位置和結束位置
print(result.span())
# group():将分組中的内容傳回出來
# 如果參數是0（group(0)），将比對到的全部内容輸出
print(result.group(1))
print(result.group(2))
# groups()将正規表達式中分組的内容合成一個元組
print(result.groups())

結果如下：

<re.Match object; span=(9883, 10002), match='<img width="100" alt="肖申克的救贖"src="https://img2.d>
(9883, 10002)
肖申克的救贖
https://img2.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg
('肖申克的救贖','https://img2.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg')

Process finished with exit code 0

第21天---Python爬蟲---requests庫第21天—Python爬蟲—requests庫

練習：

使用正規表達式比對鍊家二手房資訊：标題，位置，總價，單價

import requests
import re

for j in range(1, 101):
	# 使用for循環把1-100頁的網頁周遊
    URL = f'https://cd.lianjia.com/ershoufang/pg{j}/'
    # headers資訊
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
    }
    # 請求連結
    resp = requests.get(url=URL, headers=headers)
    # 擷取所有資訊
    li_str = re.compile('<li class="clear(.+?)</li>')
    # 類型為清單
    content = li_str.findall(resp.text)
    
    # 循環周遊
	for i in content:
        # 房屋标題
        title = re.compile(
            'data-log_index="\d{1,2}"  data-el="ershoufang" data-housecode="\d{12}" data-is_focus="" data-sl="">(.+?)</a>')
        title_content = title.search(i)
        title_1 = title_content.group(1)
        print(title_1)

        # 位置
        area = re.compile(
            'data-el="region">([\u4e00-\u9fa5]+([\u4e00-\u9fa5]+|\d+|[A-Z]+)[\u4e00-\u9fa5]+)[^\u4e00-\u9fa5]+([\u4e00-\u9fa5]+)</a>')
        area_content = area.search(i)
        x, y, z = area_content.groups()
        print(f'{x}-{z}')

        # 單價
        unit_price = re.compile('<span>(單價\d+元/平米)</span>')
        unit_price_1 = unit_price.search(i)
        print(unit_price_1.group(1))

        # 總價
        total_price = re.compile('<span>(\d+\.?\d+)</span>萬')
        total_price_1 = total_price.search(i)
        print(total_price_1.group(1) + '萬')
		
        print('*' * 10)

第21天---Python爬蟲---requests庫第21天—Python爬蟲—requests庫

requests相關操作

import requests

URL = 'https://www.baidu.com/'

# User-Agent:将爬蟲模拟成浏覽器
# Cooike：存放的使用者的賬戶密碼資訊
headers = {
    'User-Agent': ''
}

resp = requests.get(url=URL, headers=headers)
# 狀态碼
# 200，爬蟲可用
# 403，通路的網絡将爬蟲封了
# 404，頁面丢失
# 500，伺服器出問題
print(resp.status_code)

# 列印通路的網址
print(resp.url)

# 列印響應頭:隻需要記住：'Content-Type'
print(resp.headers)

# 列印響應頭中提供的編碼方式
# 如果沒有，預設ISO-8859-1：不能解析中文
print(resp.encoding)

# 列印網頁源代碼提供的編碼方式
print(resp.apparent_encoding)
# resp.encoding = resp.apparent_encoding

# 文本流方式列印網頁源碼
print(resp.text)

# 以位元組流（二進制）輸出源碼
print(resp.content)

第21天---Python爬蟲---requests庫第21天—Python爬蟲—requests庫

第21天—Python爬蟲—requests庫

Google Chrome浏覽器

擷取headers資訊

使用requests

練習：

requests相關操作

繼續閱讀

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

sort()函數到底是怎樣進行數字排序的

neo4j之cypher使用文檔

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入