天天看點

Pydantic—強大的資料校驗工具,比DRF的校驗器還快12倍

Pydantic 是一個使用Python類型注解進行資料驗證和管理的子產品。安裝方法非常簡單,打開終端輸入:

pip install pydantic           

它類似于 Django DRF 序列化器的資料校驗功能,不同的是,Django裡的序列化器的Field是有限制的,如果你想要使用自己的Field還需要繼承并重寫它的基類:

# Django 序列化器的一個使用例子,你可以和下面Pydantic的使用作對比
class Book(models.Model):
id = models.AutoField(primary_key=True)
name = models.CharField(max_length=32)
price = models.DecimalField(max_digits=5, decimal_places=2)
author = models.CharField(max_length=32)
publish = models.CharField(max_length=32)           
而 Pydantic 基于Python3.7以上的類型注解特性,實作了可以對任何類做資料校驗的功能:

上滑檢視更多代碼

# Pydantic 資料校驗功能
from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel


class User(BaseModel):
id: int
name = 'John Doe'
signup_ts: Optional[datetime] = None
friends: List[int] = 


external_data = {
'id': '123',
'signup_ts': '2019-06-01 12:22',
'friends': [1, 2, '3'],
}
user = User(**external_data)
print(user.id)
print(type(user.id))
#> 123
#> <class 'int'>
print(repr(user.signup_ts))
#> datetime.datetime(2019, 6, 1, 12, 22)
print(user.friends)
#> [1, 2, 3]
print(user.dict)
"""
{
'id': 123,
'signup_ts': datetime.datetime(2019, 6, 1, 12, 22),
'friends': [1, 2, 3],
'name': 'John Doe',
}
"""           

從上面的基本使用可以看到,它甚至能自動幫你做資料類型的轉換,比如代碼中的 user.id, 在字典中是字元串,但經過Pydantic校驗器後,它自動變成了int型,因為User類裡的注解就是int型。

當我們的資料和定義的注解類型不一緻時會報這樣的Error:

from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel


class User(BaseModel):
id: int
name = 'John Doe'
signup_ts: Optional[datetime] = None
friends: List[int] = 


external_data = {
'id': '123',
'signup_ts': '2019-06-01 12:222',
'friends': [1, 2, '3'],
}
user = User(**external_data)
"""
Traceback (most recent call last):
File "1.py", line 18, in <module>
user = User(**external_data)
File "pydantic\main.py", line 331, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for User
signup_ts
invalid datetime format (type=value_error.datetime)
"""           
即 "invalid datetime format", 因為我傳入的 signup_ts 不是标準的時間格式(多了個2)。

1. Pydantic模型資料導出

通過Pydantic模型中自帶的 json 屬性方法,能讓經過校驗後的資料一行指令直接轉成 json 字元串,如前文中的 user 對象:
print(user.dict) # 轉為字典
"""
{
'id': 123,
'signup_ts': datetime.datetime(2019, 6, 1, 12, 22),
'friends': [1, 2, 3],
'name': 'John Doe',
}
"""
print(user.json) # 轉為json
"""
{"id": 123, "signup_ts": "2019-06-01T12:22:00", "friends": [1, 2, 3], "name": "John Doe"}
"""           
非常友善。它還支援将整個資料結構導出為 schema json,它能完整地描述整個對象的資料結構類型:
print(user.schema_json(indent=2))
"""
{
"title": "User",
"type": "object",
"properties": {
"id": {
"title": "Id",
"type": "integer"
},
"signup_ts": {
"title": "Signup Ts",
"type": "string",
"format": "date-time"
},
"friends": {
"title": "Friends",
"default": ,
"type": "array",
"items": {
"type": "integer"
}
},
"name": {
"title": "Name",
"default": "John Doe",
"type": "string"
}
},
"required": [
"id"
]
}
"""           

2.資料導入

除了直接定義資料校驗模型,它還能通過ORM、字元串、檔案導入到資料校驗模型:

比如字元串(raw):

from datetime import datetime
from pydantic import BaseModel


class User(BaseModel):
id: int
name = 'John Doe'
signup_ts: datetime = None

m = User.parse_raw('{"id": 123, "name": "James"}')
print(m)
#> id=123 signup_ts=None name='James'           
此外,它能直接将ORM的對象輸入,轉為Pydantic的對象,比如從Sqlalchemy ORM:
from typing import List
from sqlalchemy import Column, Integer, String
from sqlalchemy.dialects.postgresql import ARRAY
from sqlalchemy.ext.declarative import declarative_base
from pydantic import BaseModel, constr

Base = declarative_base


class CompanyOrm(Base):
__tablename__ = 'companies'
id = Column(Integer, primary_key=True, able=False)
public_key = Column(String(20), index=True, able=False, unique=True)
name = Column(String(63), unique=True)
domains = Column(ARRAY(String(255)))


class CompanyModel(BaseModel):
id: int
public_key: constr(max_length=20)
name: constr(max_length=63)
domains: List[constr(max_length=255)]

class Config:
orm_mode = True


co_orm = CompanyOrm(
id=123,
public_key='foobar',
name='Testing',
domains=['example.com', 'foobar.com'],
)
print(co_orm)
#> <models_orm_mode.CompanyOrm object at 0x7f0bdac44850>
co_model = CompanyModel.from_orm(co_orm)
print(co_model)
#> id=123 public_key='foobar' name='Testing' domains=['example.com',
#> 'foobar.com']           

從Json檔案導入:

from datetime import datetime
from pathlib import Path
from pydantic import BaseModel


class User(BaseModel):
id: int
name = 'John Doe'
signup_ts: datetime = None

path = Path('data.json')
path.write_text('{"id": 123, "name": "James"}')
m = User.parse_file(path)
print(m)           

從pickle導入:

import pickle
from datetime import datetime
from pydantic import BaseModel

pickle_data = pickle.dumps({
'id': 123,
'name': 'James',
'signup_ts': datetime(2017, 7, 14)
})
m = User.parse_raw(
pickle_data, content_type='application/pickle', allow_pickle=True
)
print(m)
#> id=123 signup_ts=datetime.datetime(2017, 7, 14, 0, 0) name='James'           

3.自定義資料校驗

你還能給它增加 validator 裝飾器,增加你需要的校驗邏輯:

from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# 1.導入資料集
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [1, 2, 3]].values
Y = dataset.iloc[:, 4].values

# 性别轉化為數字
labelencoder_X = LabelEncoder
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

# 2.将資料集分成訓練集和測試集
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.25, random_state=0)

# 3.特征縮放
sc = StandardScaler
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# 4.訓練
classifier = LogisticRegression
classifier.fit(X_train, y_train)

# 5.預測
y_pred = classifier.predict(X_test)

# 6.評估預測

# 生成混淆矩陣
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)           

上面,我們增加了三種自定義校驗邏輯:

1.name 必須帶有空格

2.password2 必須和 password1 相同

3.username 必須為字母

讓我們試試這三個校驗是否有效:

user = UserModel(
name='samuel colvin',
username='scolvin',
password1='zxcvbn',
password2='zxcvbn',
)
print(user)
#> name='Samuel Colvin' username='scolvin' password1='zxcvbn' password2='zxcvbn'

try:
UserModel(
name='samuel',
username='scolvin',
password1='zxcvbn',
password2='zxcvbn2',
)
except ValidationError as e:
print(e)
"""
2 validation errors for UserModel
name
must contain a space (type=value_error)
password2
passwords do not match (type=value_error)
"""           

可以看到,第一個UserModel裡的資料完全沒有問題,通過校驗。

第二個UserModel裡的資料,由于name存在空格,password2和password1不一緻,無法通過校驗。是以我們定義的自定義校驗器完全有效。

4.性能表現

這是最令我驚訝的部分,Pydantic 比 Django-rest-framework 的校驗器還快了12.3倍:
Package 版本 相對表現 平均耗時
pydantic

1.7.3

93.7μs
attrs + cattrs

20.3

1.5x slower 143.6μs
valideer

0.4.2

1.9x slower 175.9μs
marshmallow

3.10

2.4x slower 227.6μs
voluptuous

0.12

2.7x slower 257.5μs
trafaret

2.1.0

3.2x slower 296.7μs
schematics

2.1.0

10.2x slower 955.5μs
django-rest-framework

3.12

12.3x slower 1148.4μs
cerberus

1.3.2

25.9x slower 2427.6μs

而且他們的所有基準測試代碼都是開源的,你可以在下面這個Github連結找到:

https://github.com/samuelcolvin/pydantic/tree/master/benchmarks