天天看點

為你的爬蟲提提速?

使用異步子產品為爬蟲提速~

項目介紹

  本文将展示如何利用Python中的異步子產品來提高爬蟲的效率。

  我們需要爬取的目标為:融360網站上的理财産品資訊(https://www.rong360.com/licai-bank/list/p1),頁面如下:

為你的爬蟲提提速?

我們需要爬取86394條理财産品的資訊,每頁10條,也就是8640個頁面。

  在文章Python爬蟲(16)利用Scrapy爬取銀行理财産品資訊(共12多萬條)中,我們使用爬蟲架構Scrapy實作了該爬蟲,爬取了127130條資料,并存入MongoDB,整個過程耗時3小時。按道理來說,使用Scrapy實作爬蟲是較好的選擇,但是在速度上,是否能有所提升呢?本文将展示如何利用Python中的異步子產品(aiohtpp和asyncio)來提高爬蟲的效率。

爬蟲項目

  我們的爬蟲分兩步走:

  1. 爬取融360網頁上的理财産品資訊并存入csv檔案;
  2. 讀取csv檔案并存入至MySQL資料庫。

  首先,我們爬取融360網頁上的理财産品資訊并存入csv檔案,我們使用aiohttp和asyncio來加速爬蟲,完整的Python代碼如下:

import re
import time
import aiohttp
import asyncio
import pandas as pd
import logging

# 設定日志格式
logging.basicConfig(level = logging.INFO, format=\'%(asctime)s - %(levelname)s: %(message)s\')
logger = logging.getLogger(__name__)


df = pd.DataFrame(columns=[\'name\', \'bank\', \'currency\', \'startDate\',\
                           \'endDate\', \'period\', \'proType\', \'profit\', \'amount\'])

# 異步HTTP請求
async def fetch(sem, session, url):
    async with sem:
        headers = {\'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36\'}
        async with session.get(url, headers=headers) as response:
            return await response.text()

# 解析網頁
async def parser(html):
    # 利用正規表達式解析網頁
    tbody = re.findall(r"<tbody>[\s\S]*?</tbody>", html)[0]
    trs = re.findall(r"<tr [\s\S]*?</tr>", tbody)
    for tr in trs:
        tds = re.findall(r"<td[\s\S]*?</td>", tr)
        name,bank = re.findall(r\'title="(.+?)"\', \'\'.join(tds))
        name = name.replace(\'&amp;\', \'\').replace(\'quot;\', \'\')
        currency, startDate, endDate, amount = re.findall(r\'<td>(.+?)</td>\', \'\'.join(tds))
        period = \'\'.join(re.findall(r\'<td class="td7">(.+?)</td>\', tds[5]))
        proType = \'\'.join(re.findall(r\'<td class="td7">(.+?)</td>\', tds[6]))
        profit = \'\'.join(re.findall(r\'<td class="td8">(.+?)</td>\', tds[7]))
        df.loc[df.shape[0] + 1] = [name, bank, currency, startDate, endDate, \
                                   period, proType, profit, amount]

    logger.info(str(df.shape[0])+\'\t\'+name)

# 處理網頁
async def download(sem, url):
    async with aiohttp.ClientSession() as session:
        try:
            html = await fetch(sem, session, url)
            await parser(html)
        except Exception as err:
            print(err)

# 全部網頁
urls = ["https://www.rong360.com/licai-bank/list/p%d"%i for i in range(1, 8641)]

# 統計該爬蟲的消耗時間
print(\'*\' * 50)
t3 = time.time()

# 利用asyncio子產品進行異步IO處理
loop = asyncio.get_event_loop()
sem=asyncio.Semaphore(100)
tasks = [asyncio.ensure_future(download(sem, url)) for url in urls]
tasks = asyncio.gather(*tasks)
loop.run_until_complete(tasks)

df.to_csv(\'E://rong360.csv\')

t4 = time.time()
print(\'總共耗時:%s\' % (t4 - t3))
print(\'*\' * 50)
           

輸出的結果如下(中間的輸出已省略,以......代替):

**************************************************
2018-10-17 13:33:50,717 - INFO: 10	金百合第245期
2018-10-17 13:33:50,749 - INFO: 20	金荷恒升2018年第26期
......
2018-10-17 14:03:34,906 - INFO: 86381	翠竹同益1M22期FGAB15015A
2018-10-17 14:03:35,257 - INFO: 86391	潤鑫月月盈2号
總共耗時:1787.4312353134155
**************************************************
           

可以看到,在這個爬蟲中,我們爬取了86391條資料,耗時1787.4秒,不到30分鐘。雖然資料比預期的少了3條,但這點損失不算什麼。來看一眼csv檔案中的資料:

為你的爬蟲提提速?

  OK,離我們的目标還差一步,将這個csv檔案存入至MySQL,具體的操作方法可參考文章:Python之使用Pandas庫實作MySQL資料庫的讀寫:https://www.jianshu.com/p/238a13995b2b 。完整的Python代碼如下:

# -*- coding: utf-8 -*-

# 導入必要子產品
import pandas as pd
from sqlalchemy import create_engine

# 初始化資料庫連接配接,使用pymysql子產品
engine = create_engine(\'mysql+pymysql://root:******@localhost:33061/test\', echo=True)

print("Read CSV file...")
# 讀取本地CSV檔案
df = pd.read_csv("E://rong360.csv", sep=\',\', encoding=\'gb18030\')

# 将建立的DataFrame儲存為MySQL中的資料表,不儲存index列
df.to_sql(\'rong360\',
          con=engine,
          index= False,
          index_label=\'name\'
          )

print("Write to MySQL successfully!")
           

輸出結果如下(耗時十幾秒):

Read CSV file...
2018-10-17 15:07:02,447 INFO sqlalchemy.engine.base.Engine SHOW VARIABLES LIKE \'sql_mode\'
2018-10-17 15:07:02,447 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,452 INFO sqlalchemy.engine.base.Engine SELECT DATABASE()
2018-10-17 15:07:02,452 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,454 INFO sqlalchemy.engine.base.Engine show collation where `Charset` = \'utf8mb4\' and `Collation` = \'utf8mb4_bin\'
2018-10-17 15:07:02,454 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,455 INFO sqlalchemy.engine.base.Engine SELECT CAST(\'test plain returns\' AS CHAR(60)) AS anon_1
2018-10-17 15:07:02,456 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,456 INFO sqlalchemy.engine.base.Engine SELECT CAST(\'test unicode returns\' AS CHAR(60)) AS anon_1
2018-10-17 15:07:02,456 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,457 INFO sqlalchemy.engine.base.Engine SELECT CAST(\'test collated returns\' AS CHAR CHARACTER SET utf8mb4) COLLATE utf8mb4_bin AS anon_1
2018-10-17 15:07:02,457 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,458 INFO sqlalchemy.engine.base.Engine DESCRIBE `rong360`
2018-10-17 15:07:02,458 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,459 INFO sqlalchemy.engine.base.Engine ROLLBACK
2018-10-17 15:07:02,462 INFO sqlalchemy.engine.base.Engine 
CREATE TABLE rong360 (
	`Unnamed: 0` BIGINT, 
	name TEXT, 
	bank TEXT, 
	currency TEXT, 
	`startDate` TEXT, 
	`endDate` TEXT, 
	enduration TEXT, 
	`proType` TEXT, 
	profit TEXT, 
	amount TEXT
)


2018-10-17 15:07:02,462 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,867 INFO sqlalchemy.engine.base.Engine COMMIT
2018-10-17 15:07:02,909 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2018-10-17 15:07:03,973 INFO sqlalchemy.engine.base.Engine INSERT INTO rong360 (`Unnamed: 0`, name, bank, currency, `startDate`, `endDate`, enduration, `proType`, profit, amount) VALUES (%(Unnamed: 0)s, %(name)s, %(bank)s, %(currency)s, %(startDate)s, %(endDate)s, %(enduration)s, %(proType)s, %(profit)s, %(amount)s)
2018-10-17 15:07:03,974 INFO sqlalchemy.engine.base.Engine ({\'Unnamed: 0\': 1, \'name\': \'龍信20183773\', \'bank\': \'龍江銀行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-14\', \'enduration\': \'99天\', \'proType\': \'不保本\', \'profit\': \'4.8%\', \'amount\': \'5萬\'}, {\'Unnamed: 0\': 2, \'name\': \'福瀛家NDHLCS20180055B\', \'bank\': \'甯波東海銀行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-17\', \'enduration\': \'179天\', \'proType\': \'保證收益\', \'profit\': \'4.8%\', \'amount\': \'5萬\'}, {\'Unnamed: 0\': 3, \'name\': \'薪鑫樂2018年第6期\', \'bank\': \'無為農商行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-21\', \'enduration\': \'212天\', \'proType\': \'不保本\', \'profit\': \'4.8%\', \'amount\': \'5萬\'}, {\'Unnamed: 0\': 4, \'name\': \'安鑫MTLC18165\', \'bank\': \'民泰商行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-15\', \'enduration\': \'49天\', \'proType\': \'不保本\', \'profit\': \'4.75%\', \'amount\': \'5萬\'}, {\'Unnamed: 0\': 5, \'name\': \'農銀私行·如意ADRY181115A\', \'bank\': \'農業銀行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-16\', \'enduration\': \'90天\', \'proType\': \'不保本\', \'profit\': \'4.75%\', \'amount\': \'100萬\'}, {\'Unnamed: 0\': 6, \'name\': \'穩健成長(2018)176期\', \'bank\': \'威海市商業銀行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-15\', \'enduration\': \'91天\', \'proType\': \'不保本\', \'profit\': \'4.75%\', \'amount\': \'5萬\'}, {\'Unnamed: 0\': 7, \'name\': \'季季紅J18071\', \'bank\': \'溫州銀行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-16\', \'enduration\': \'96天\', \'proType\': \'不保本\', \'profit\': \'4.75%\', \'amount\': \'1萬\'}, {\'Unnamed: 0\': 8, \'name\': \'私人銀行客戶84618042\', \'bank\': \'興業銀行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-17\', \'enduration\': \'99天\', \'proType\': \'不保本\', \'profit\': \'4.75%\', \'amount\': \'50萬\'}  ... displaying 10 of 86391 total bound parameter sets ...  {\'Unnamed: 0\': 86390, \'name\': \'潤鑫月月盈3号RX1M003\', \'bank\': \'珠海華潤銀行\', \'currency\': \'人民币\', \'startDate\': \'2015-06-24\', \'endDate\': \'2015-06-30\', \'enduration\': \'35天\', \'proType\': \'不保本\', \'profit\': \'4.5%\', \'amount\': \'5萬\'}, {\'Unnamed: 0\': 86391, \'name\': \'潤鑫月月盈2号\', \'bank\': \'珠海華潤銀行\', \'currency\': \'人民币\', \'startDate\': \'2015-06-17\', \'endDate\': \'2015-06-23\', \'enduration\': \'35天\', \'proType\': \'不保本\', \'profit\': \'4.4%\', \'amount\': \'5萬\'})
2018-10-17 15:07:14,106 INFO sqlalchemy.engine.base.Engine COMMIT
Write to MySQL successfully!
           

  如果你還不放心,也許我們可以看一眼MySQL中的資料:

為你的爬蟲提提速?

總結

  讓我們來比較該爬蟲與使用Scrapy的爬蟲。使用Scrap用的爬蟲爬取了127130條資料,耗時3小時,該爬蟲爬取86391條資料,耗時半小時。如果是同樣的資料量,那麼Scrapy爬取86391條資料耗時約2小時,該爬蟲僅用了Scrapy爬蟲的四分之一的時間就出色地完成了任務。

  最後,讓我們看看前十名的銀行及理财産品數量(按理财産品數量從高到低排列),輸入以下MySQL指令:

use test;
SELECT bank, count(*) as product_num 
FROM rong360
GROUP BY bank
ORDER BY product_num DESC
LIMIT 10;
           

輸出結果如下:

為你的爬蟲提提速?