Python Django实现MySQL百万、千万级的数据量下载：解决memoryerror、nginx time out

前文

在用Django写项目的时候时常需要提供文件下载的功能，而Django也是贴心提供了几种方法：FileResponse、StreamingHttpResponse、HttpResponse，其中FileResponse和StreamingHttpResponse都是使用迭代器迭代生成数据的方法，所以适合传输文件比较大的情况；而HttpResponse则是直接取得数据返回给用户，所以容易造成memoryerror和nginx time out(一次性取得数据和返回的数据过多，导致nginx超时或者内存不足)，关于这三者，DJango的官网也是写的非常清楚，连接如下：https://docs.djangoproject.com/en/1.11/ref/request-response/

那正常我们使用的是FileResponse和StreamingHttpResponse，因为它们流式传输(迭代器)的特点，可以使得数据一条条的返回给客户端，文件随时中断和复传，并且保持文件的一致性。

FileResponse和StreamingHttpResponse

FileResponse顾名思义，就是打开文件然后进行传输，并且可以指定一次能够传输的数据chunk。所以适用场景：从服务端返回大文件。缺点是无法实时获取数据库的内容并传输给客户端。举例如下：

def download(request):
	file=open('path/demo.py','rb')
    response =FileResponse(file)
    response['Content-Type']='application/octet-stream'
    response['Content-Disposition']='attachment;filename="demo.py"'
    return response

从上可以发现，文件打开后作为参数传入FileResponse，随后指定传输头即可，但是很明显用这个来传输数据库就不太方便了，所以这边推介用StreamingHttpResponse的方式来传输。

这里就用PyMysql来取得数据，然后指定为csv的格式返回，具体代码如下：

# 通过pymysql取得数据
import pymysql
field_types = {
        1: 'tinyint',
        2: 'smallint',
        3: 'int'}  #用于后面的字段名匹配，这里省略了大多数
conn = pymysql.connect(host='127.0.0.1',port=3306,database='demo',user='root',password='root')
cursor = conn.cursor(cursor=pymysql.cursors.DictCursor)
cursor.execute(sql)
#获取所有数据
data = cursor.fetchall()
cols = {}
#获取所有字段
for i,row in enumerate(self.cursor.description):
	if row[0] in cols:
	    cols[str(i)+row[0]] = field_types.get(row[1], str(row[1]))  #这里的field_type是类型和数字的匹配
	cols[row[0]] = field_types.get(row[1], str(row[1]))
cursor.close()
conn.close()

#通过StreamingHttpResponse指定返回格式为csv
response = StreamingHttpResponse(get_result_fromat(data, cols))
response['Content-Type'] = 'application/octet-stream'
response['Content-Disposition'] = 'attachment;filename="{0}"'.format(out_file_name)
return response

#循环所有数据，然后加到字段上返回，注意的是要用迭代器来控制
def get_result_fromat(data, cols):
	tmp_str = ""
	# 返回文件的每一列列名
    for col in cols:
        tmp_str += '"%s",' % (col)
    yield tmp_str.strip(",") + "\n"
    for row in data:
        tmp_str = ""
        for col in cols:
            tmp_str += '"%s",' % (str(row[col]))
        yield tmp_str.strip(',') + "\n"

整个代码如上，大致分为三部分：从mysql取数据，格式化成我们想要的格式：excel、csv、txt等等，这边指定的是csv，如果对其他格式也有兴趣的可以留言，最后就是用StreamingHttpResponse指定返回的格式返回。

实现百万级数据量下载

上面的代码下载可以支持几万行甚至十几万行的数据，但是如果超过20万行以上的数据，那就比较困难了，我这边的剩余内存大概是1G的样子，当超过15万行数据(大概)的时候，就报memoryerror了，问题就是因为fetchall，虽然我们StreamingHttpResponse是一条条的返回，但是我们的数据时一次性批量的取得！

如何解决？以下是我的解决方法和思路：

用fetchone来代替fetchall，迭代生成fetchone
发现还是memoryerror，因为execute是一次性执行，后来发现可以用流式游标来代替原来的普通游标，即SSDictCursor代替DictCursor

于是整个代码需要修改的地方如下：

cursor = conn.cursor(cursor=pymysql.cursors.DictCursor) ===>
cursor = conn.cursor(cursor=pymysql.cursors.SSDictCursor)

data = cursor.fetchall()   ===>
row = cursor.fetchone()

def get_result_fromat(data, cols):
	tmp_str = ""
	# 返回文件的每一列列名
    for col in cols:
        tmp_str += '"%s",' % (col)
    yield tmp_str.strip(",") + "\n"
    for row in data:
        tmp_str = ""
        for col in cols:
            tmp_str += '"%s",' % (str(row[col]))
        yield tmp_str.strip(',') + "\n"  
        
        =====>
        
def get_result_fromat(data, cols):
	tmp_str = ""
    for col in cols:
        tmp_str += '"%s",' % (col)
    yield tmp_str.strip(",") + "\n"
    while True:
        tmp_str = ""
        for col in cols:
            tmp_str += '"%s",' % (str(row[col]))
        yield tmp_str.strip(',') + "\n"
        row = db.cursor.fetchone()
        if row is None:
            break

可以看到就是通过while True来实现不断地取数据下载，有效避免一次性从MySQL取出内存不足报错，又或者取得过久导致nginx超时！

总结

关于下载就分享到这了，还是比较简单的，谢谢观看~

Python Django实现MySQL百万、千万级的数据量下载：解决memoryerror、nginx time out

前文

FileResponse和StreamingHttpResponse

实现百万级数据量下载

总结

继续阅读

来自python的【条件控制/语句循环/break/continue/else/pass】一、条件控制二、语句循环

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入