使用异步模块为爬虫提速~
项目介绍
本文将展示如何利用Python中的异步模块来提高爬虫的效率。
我们需要爬取的目标为:融360网站上的理财产品信息(https://www.rong360.com/licai-bank/list/p1),页面如下:

我们需要爬取86394条理财产品的信息,每页10条,也就是8640个页面。
在文章Python爬虫(16)利用Scrapy爬取银行理财产品信息(共12多万条)中,我们使用爬虫框架Scrapy实现了该爬虫,爬取了127130条数据,并存入MongoDB,整个过程耗时3小时。按道理来说,使用Scrapy实现爬虫是较好的选择,但是在速度上,是否能有所提升呢?本文将展示如何利用Python中的异步模块(aiohtpp和asyncio)来提高爬虫的效率。
爬虫项目
我们的爬虫分两步走:
- 爬取融360网页上的理财产品信息并存入csv文件;
- 读取csv文件并存入至MySQL数据库。
首先,我们爬取融360网页上的理财产品信息并存入csv文件,我们使用aiohttp和asyncio来加速爬虫,完整的Python代码如下:
import re
import time
import aiohttp
import asyncio
import pandas as pd
import logging
# 设置日志格式
logging.basicConfig(level = logging.INFO, format=\'%(asctime)s - %(levelname)s: %(message)s\')
logger = logging.getLogger(__name__)
df = pd.DataFrame(columns=[\'name\', \'bank\', \'currency\', \'startDate\',\
\'endDate\', \'period\', \'proType\', \'profit\', \'amount\'])
# 异步HTTP请求
async def fetch(sem, session, url):
async with sem:
headers = {\'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36\'}
async with session.get(url, headers=headers) as response:
return await response.text()
# 解析网页
async def parser(html):
# 利用正则表达式解析网页
tbody = re.findall(r"<tbody>[\s\S]*?</tbody>", html)[0]
trs = re.findall(r"<tr [\s\S]*?</tr>", tbody)
for tr in trs:
tds = re.findall(r"<td[\s\S]*?</td>", tr)
name,bank = re.findall(r\'title="(.+?)"\', \'\'.join(tds))
name = name.replace(\'&\', \'\').replace(\'quot;\', \'\')
currency, startDate, endDate, amount = re.findall(r\'<td>(.+?)</td>\', \'\'.join(tds))
period = \'\'.join(re.findall(r\'<td class="td7">(.+?)</td>\', tds[5]))
proType = \'\'.join(re.findall(r\'<td class="td7">(.+?)</td>\', tds[6]))
profit = \'\'.join(re.findall(r\'<td class="td8">(.+?)</td>\', tds[7]))
df.loc[df.shape[0] + 1] = [name, bank, currency, startDate, endDate, \
period, proType, profit, amount]
logger.info(str(df.shape[0])+\'\t\'+name)
# 处理网页
async def download(sem, url):
async with aiohttp.ClientSession() as session:
try:
html = await fetch(sem, session, url)
await parser(html)
except Exception as err:
print(err)
# 全部网页
urls = ["https://www.rong360.com/licai-bank/list/p%d"%i for i in range(1, 8641)]
# 统计该爬虫的消耗时间
print(\'*\' * 50)
t3 = time.time()
# 利用asyncio模块进行异步IO处理
loop = asyncio.get_event_loop()
sem=asyncio.Semaphore(100)
tasks = [asyncio.ensure_future(download(sem, url)) for url in urls]
tasks = asyncio.gather(*tasks)
loop.run_until_complete(tasks)
df.to_csv(\'E://rong360.csv\')
t4 = time.time()
print(\'总共耗时:%s\' % (t4 - t3))
print(\'*\' * 50)
输出的结果如下(中间的输出已省略,以......代替):
**************************************************
2018-10-17 13:33:50,717 - INFO: 10 金百合第245期
2018-10-17 13:33:50,749 - INFO: 20 金荷恒升2018年第26期
......
2018-10-17 14:03:34,906 - INFO: 86381 翠竹同益1M22期FGAB15015A
2018-10-17 14:03:35,257 - INFO: 86391 润鑫月月盈2号
总共耗时:1787.4312353134155
**************************************************
可以看到,在这个爬虫中,我们爬取了86391条数据,耗时1787.4秒,不到30分钟。虽然数据比预期的少了3条,但这点损失不算什么。来看一眼csv文件中的数据:
OK,离我们的目标还差一步,将这个csv文件存入至MySQL,具体的操作方法可参考文章:Python之使用Pandas库实现MySQL数据库的读写:https://www.jianshu.com/p/238a13995b2b 。完整的Python代码如下:
# -*- coding: utf-8 -*-
# 导入必要模块
import pandas as pd
from sqlalchemy import create_engine
# 初始化数据库连接,使用pymysql模块
engine = create_engine(\'mysql+pymysql://root:******@localhost:33061/test\', echo=True)
print("Read CSV file...")
# 读取本地CSV文件
df = pd.read_csv("E://rong360.csv", sep=\',\', encoding=\'gb18030\')
# 将新建的DataFrame储存为MySQL中的数据表,不储存index列
df.to_sql(\'rong360\',
con=engine,
index= False,
index_label=\'name\'
)
print("Write to MySQL successfully!")
输出结果如下(耗时十几秒):
Read CSV file...
2018-10-17 15:07:02,447 INFO sqlalchemy.engine.base.Engine SHOW VARIABLES LIKE \'sql_mode\'
2018-10-17 15:07:02,447 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,452 INFO sqlalchemy.engine.base.Engine SELECT DATABASE()
2018-10-17 15:07:02,452 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,454 INFO sqlalchemy.engine.base.Engine show collation where `Charset` = \'utf8mb4\' and `Collation` = \'utf8mb4_bin\'
2018-10-17 15:07:02,454 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,455 INFO sqlalchemy.engine.base.Engine SELECT CAST(\'test plain returns\' AS CHAR(60)) AS anon_1
2018-10-17 15:07:02,456 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,456 INFO sqlalchemy.engine.base.Engine SELECT CAST(\'test unicode returns\' AS CHAR(60)) AS anon_1
2018-10-17 15:07:02,456 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,457 INFO sqlalchemy.engine.base.Engine SELECT CAST(\'test collated returns\' AS CHAR CHARACTER SET utf8mb4) COLLATE utf8mb4_bin AS anon_1
2018-10-17 15:07:02,457 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,458 INFO sqlalchemy.engine.base.Engine DESCRIBE `rong360`
2018-10-17 15:07:02,458 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,459 INFO sqlalchemy.engine.base.Engine ROLLBACK
2018-10-17 15:07:02,462 INFO sqlalchemy.engine.base.Engine
CREATE TABLE rong360 (
`Unnamed: 0` BIGINT,
name TEXT,
bank TEXT,
currency TEXT,
`startDate` TEXT,
`endDate` TEXT,
enduration TEXT,
`proType` TEXT,
profit TEXT,
amount TEXT
)
2018-10-17 15:07:02,462 INFO sqlalchemy.engine.base.Engine {}
2018-10-17 15:07:02,867 INFO sqlalchemy.engine.base.Engine COMMIT
2018-10-17 15:07:02,909 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2018-10-17 15:07:03,973 INFO sqlalchemy.engine.base.Engine INSERT INTO rong360 (`Unnamed: 0`, name, bank, currency, `startDate`, `endDate`, enduration, `proType`, profit, amount) VALUES (%(Unnamed: 0)s, %(name)s, %(bank)s, %(currency)s, %(startDate)s, %(endDate)s, %(enduration)s, %(proType)s, %(profit)s, %(amount)s)
2018-10-17 15:07:03,974 INFO sqlalchemy.engine.base.Engine ({\'Unnamed: 0\': 1, \'name\': \'龙信20183773\', \'bank\': \'龙江银行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-14\', \'enduration\': \'99天\', \'proType\': \'不保本\', \'profit\': \'4.8%\', \'amount\': \'5万\'}, {\'Unnamed: 0\': 2, \'name\': \'福瀛家NDHLCS20180055B\', \'bank\': \'宁波东海银行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-17\', \'enduration\': \'179天\', \'proType\': \'保证收益\', \'profit\': \'4.8%\', \'amount\': \'5万\'}, {\'Unnamed: 0\': 3, \'name\': \'薪鑫乐2018年第6期\', \'bank\': \'无为农商行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-21\', \'enduration\': \'212天\', \'proType\': \'不保本\', \'profit\': \'4.8%\', \'amount\': \'5万\'}, {\'Unnamed: 0\': 4, \'name\': \'安鑫MTLC18165\', \'bank\': \'民泰商行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-15\', \'enduration\': \'49天\', \'proType\': \'不保本\', \'profit\': \'4.75%\', \'amount\': \'5万\'}, {\'Unnamed: 0\': 5, \'name\': \'农银私行·如意ADRY181115A\', \'bank\': \'农业银行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-16\', \'enduration\': \'90天\', \'proType\': \'不保本\', \'profit\': \'4.75%\', \'amount\': \'100万\'}, {\'Unnamed: 0\': 6, \'name\': \'稳健成长(2018)176期\', \'bank\': \'威海市商业银行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-15\', \'enduration\': \'91天\', \'proType\': \'不保本\', \'profit\': \'4.75%\', \'amount\': \'5万\'}, {\'Unnamed: 0\': 7, \'name\': \'季季红J18071\', \'bank\': \'温州银行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-16\', \'enduration\': \'96天\', \'proType\': \'不保本\', \'profit\': \'4.75%\', \'amount\': \'1万\'}, {\'Unnamed: 0\': 8, \'name\': \'私人银行客户84618042\', \'bank\': \'兴业银行\', \'currency\': \'人民币\', \'startDate\': \'2018-10-12\', \'endDate\': \'2018-10-17\', \'enduration\': \'99天\', \'proType\': \'不保本\', \'profit\': \'4.75%\', \'amount\': \'50万\'} ... displaying 10 of 86391 total bound parameter sets ... {\'Unnamed: 0\': 86390, \'name\': \'润鑫月月盈3号RX1M003\', \'bank\': \'珠海华润银行\', \'currency\': \'人民币\', \'startDate\': \'2015-06-24\', \'endDate\': \'2015-06-30\', \'enduration\': \'35天\', \'proType\': \'不保本\', \'profit\': \'4.5%\', \'amount\': \'5万\'}, {\'Unnamed: 0\': 86391, \'name\': \'润鑫月月盈2号\', \'bank\': \'珠海华润银行\', \'currency\': \'人民币\', \'startDate\': \'2015-06-17\', \'endDate\': \'2015-06-23\', \'enduration\': \'35天\', \'proType\': \'不保本\', \'profit\': \'4.4%\', \'amount\': \'5万\'})
2018-10-17 15:07:14,106 INFO sqlalchemy.engine.base.Engine COMMIT
Write to MySQL successfully!
如果你还不放心,也许我们可以看一眼MySQL中的数据:
总结
让我们来比较该爬虫与使用Scrapy的爬虫。使用Scrap用的爬虫爬取了127130条数据,耗时3小时,该爬虫爬取86391条数据,耗时半小时。如果是同样的数据量,那么Scrapy爬取86391条数据耗时约2小时,该爬虫仅用了Scrapy爬虫的四分之一的时间就出色地完成了任务。
最后,让我们看看前十名的银行及理财产品数量(按理财产品数量从高到低排列),输入以下MySQL命令:
use test;
SELECT bank, count(*) as product_num
FROM rong360
GROUP BY bank
ORDER BY product_num DESC
LIMIT 10;
输出结果如下: