爬蟲筆記3 Xpath文法與多程序爬蟲

@蘭博怎麼玩兒

本文介紹：本文介紹Xpath文法，并用一個執行個體對比BeautifulSoup與Xpath的爬取速度，最後我們再介紹如何多線程地運作爬蟲程式。

1、Xpath的使用

1.1 基本用法

Xpath需要第三方庫：lxml。其文法是：

擷取文本：

//标簽1[@屬性1=“屬性值1”]/标簽2[@屬性2=“屬性值2”]/.../text()

擷取屬性值

//标簽1[@屬性1=“屬性值1”]/标簽2[@屬性2=“屬性值2”]/.../@屬性n

Xpath可以通過以下方式擷取：

1、滑鼠放到網頁上待抓取資料上（這裡是糗事百科使用者id），右鍵，檢查。

爬蟲筆記3 Xpath文法與多程式爬蟲

2、開發工具中，右鍵所選元素，Copy–Copy-XPath

爬蟲筆記3 Xpath文法與多程式爬蟲

這樣可以複制粘貼使用Xpath：

from lxml import etree
res = requests.get(url, headers=headers)
selector = etree.HTML(res.text)
url_infos = selector.xpath('//div[@class="article block untagged mb15"]/div[1]/a[2]/h2/text()')[0]

注意，第一個div标簽為div[1]而不是div[0]。

1.2 特殊情況

相同字元串開頭

<body>
	<div id="test-1">需要的内容1</div>
	<div id="testuseful">需要的内容2</div>
	<div id="useless">不需要的内容</div>
</boidy>

假如有如上HTML代碼，要抓取需要的内容1，需要的内容2，可以利用start-with(@屬性名,“相同的開頭部分”)來擷取：

//div[start-with(@id,"test")]/text()

屬性值包含相同字元串

<body>
	<div id="abc-key">需要的内容1</div>
	<div id="ab-keycd">需要的内容2</div>
	<div id="useless">不需要的内容</div>
</boidy>

假如有如上HTML代碼，要抓取需要的内容1，需要的内容2，可以利用contains(@屬性名,“相同的部分”)來擷取：

//div[contains(@id,"-key")]/text()

支援先取大後取小

Xpath可以先抓取一個标簽，然後再對這個标簽進一步執行Xpath，例如對于：

//div[@id=“useful”]/li/ul[2]/text()

可以分成兩步執行：

useful=selector.xpath('//div[@id=“usefull”]')
info=useful.xpath('li/ul[2]/text()')

2、實戰：速度對比

這裡我們通過爬取糗事（https://www.qiushibaike.com/text/）百科文字闆塊來比較BeautifulSoup與Xpath的爬取速度。爬取的内容有使用者id，笑話内容，點贊數和評論數。

爬蟲筆記3 Xpath文法與多程式爬蟲

代碼如下：

import requests
from bs4 import BeautifulSoup
from lxml import etree
import time

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
}

urls = ['http://www.qiushibaike.com/text/page/{}/'.format(str(i)) for i in range(1,36)]

def bs_scraper(url):
    res = requests.get(url, headers=headers)
    soup = BeautifulSoup(res.text,'lxml')
    ids = soup.select('a > h2')
    contents = soup.select('div > span')
    laughs = soup.select('span.stats-vote > i')
    comments = soup.select('i.number')
    for id,content,laugh,comment in zip(ids,contents,laughs,comments):
        info = {
            'id':id.get_text(),
            'content':content.get_text(),
            'laugh':laugh.get_text(),
            'comment':comment.get_text()
        }
        return info

def lxml_scraper(url):
    res = requests.get(url, headers=headers)
    selector = etree.HTML(res.text)
    url_infos = selector.xpath('//div[@class="article block untagged mb15"]')
    try:
        for url_info in url_infos:
            id = url_info.xpath('div[1]/a[2]/h2/text()')[0]
            content = url_info.xpath('a[1]/div/span/text()')[0]
            laugh = url_info.xpath('div[2]/span[1]/i/text()')[0]
            comment = url_info.xpath('div[2]/span[2]/a/i/text()')[0]
            info = {
                'id':id,
                'content':content,
                'laugh':laugh,
                'comment':comment
            }
            return info
    except IndexError:      #忽略内容錯誤（無評論或點贊）
        pass

if __name__ == '__main__':
    for name,scraper in [('BeautifulSoup',bs_scraper),('Lxml',lxml_scraper)]:
        start = time.time()        #記錄開始時間
        for url in urls:
            scraper(url)
        end = time.time()          #記錄終止時間
        print(name,end-start)      #運作時間

運作結果：

爬取35頁：

爬蟲筆記3 Xpath文法與多程式爬蟲

速度差異顯而易見，是以我們更推薦使用Xpath來抓取大量網頁的資料。如果再結合這多線程爬取，速度提升将更為顯著。

3、多程序爬蟲

這裡我們介紹一種較為簡單，适合線程代碼執行較少的多線程方法：利用multiprocessing庫的程序池進行多程序爬蟲，方法如下：

from multiprocessing import Pool
pool = Pool(processes=5)
pool.map(funs, iterable)

代碼第二行配置設定了5個程序池，第三行，參數funs為需要運作的函數；iterable是疊代參數，在爬蟲中可以是url集。

仍然是第二部的執行個體，我們單獨給Xpath方法配置設定7個程序池并行運作程式，與單程序的運作時間做對比：

更改了主函數：

if __name__ == '__main__':
    start1 = time.time()
    for url in urls:
       lxml_scraper(url)
    end1 = time.time()
    print('1個程序:',end1-start1)

    start2 = time.time()
    pool = Pool(processes=7)
    pool.map(lxml_scraper, urls)
    end2 =time.time()
    print('7個程序:',end2-start2)

運作結果：

爬蟲筆記3 Xpath文法與多程式爬蟲

可以看出，7個程序的運作時間約為單程序的7分之1。

爬蟲筆記3 Xpath文法與多程序爬蟲

1、Xpath的使用

1.1 基本用法

1.2 特殊情況

2、實戰：速度對比

3、多程序爬蟲

繼續閱讀

v2ex的簡單爬蟲

Python漫畫爬蟲開源 66漫畫 AJAX，包含資料庫連接配接，圖檔下載下傳處理

requests子產品進行人人網模拟登陸

Python image.show() 出錯FSPathMakeRef(/Applications/Preview.app) failed with error -43

2023爬蟲學習筆記 -- 多線程操作

M團店鋪評價采集不到問題問題展示：解決方案：

Python爬蟲學習（1）

Python爬蟲學習進階

Python爬蟲（入門+進階）學習筆記 1-2 初識Python爬蟲

Python進階爬蟲——Class1：認識爬蟲

python爬蟲學習筆記-1

python學習之urllib使用小結

NOIp模拟題之肮髒的牧師（桶排序）

一篇文章教你如何在一個月内學會爬取大規模資料

Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗Pyhton爬蟲實戰 - 抓取BOSS直聘職位描述和資料清洗

sort()函數到底是怎樣進行數字排序的