åè¨
ç¸ä¿¡å¾å¤æåå¬å°è¿ä¸ªæ é¢ç¬¬ä¸ååºï¼åºäºSQLè½å¤å®ç°æºå¨å¦ä¹ ï¼å½ç¶è¿æçæåæ¯æè¿äºè§£ä¸äºç±»ä¼¼é¡¹ç®çï¼ä¾å¦Byzeråé¿éçSQLFlowãé½æ¯åºäºSQLè¯è¨å»å®ç°ä¸äºæºå¨å¦ä¹ ç®æ³ãä½æ¯çæ£ç¨è¿çæååºè¯¥è¿æ¯å°æ°çï¼ä¸è¬ä½¿ç¨åºæ¯ä¹é½æ¯ä½¿ç¨Pythonè°ç¨sklearnå®ç°ä¸äºç®åçæºå¨å¦ä¹ ãå¾å°æå欢ç§ç çæåä»åºå±æææ°å¦ç®æ³åå®å®ç°æ´ä¸ªç®æ³ï¼æ认è¯çä¸äºåäºæåå ä¹é½æ¯ç´æ¥è°ç¨sklearnå®ç°ï¼ç¡®å®ç°å¨ä¹æ¯ä¸è¬è°åºè°åä¹è½å¤å¿«éå®ç°ä¸äºåè½ï¼ä½æ¯è¿ä½¿å¾æºå¨å¦ä¹ çé¨æ 大大éä½ã
æäºæºå¨å¦ä¹ çæ¥å£ä¾¿å¯éæé常å¤çåè½ï¼å¦æè½å¤è§£æä¼ è¾çSQLè¯å¥ï¼è½å¤è§£æå ¶ä¸çæ°æ®åæ³è¦å®ç°çä¸äºæºå¨å¦ä¹ ç®æ³ååæ°ï¼é£ä¹å³å¯è°ç¨sklearnåè½ãè¿æ¯æçåæ¥æ³æ³ï¼å½ç¶ä¹åè¿è¡çsqlparse解æä¹æ¯æ¿æ äºç¸å½å¤§çåè½ï¼å¯ä»¥å°å¦ä»è¿ä¸ªç®æ åä¸äºåæ¥çæç»äºã
ä¸ãæ»ä½æ¶æ
æçåææ建æ³æ³å¯è½ç¸å¯¹æ¯è¾ç®åï¼ç®åçæ¥è¯´å为äºæ¥ã å¦ææ们æ³è¦å®ç°åºäºSQLè¯è¨çæºå¨å¦ä¹ åæï¼é£ä¹é¦å æ个人æè§åºè¯¥ä¸å¹³å°åç³»ç»å·®ä¸å¤ã以ç¨æ·è§åº¦æèï¼æä¼ å ¥çæ¯ä¸æ¡SQLè¯å¥ï¼å ¶ä¸å å«ææ³è¦ä¼ è¾çæ°æ®åºç表å å«çå段åååéå¶æ¡ä»¶ï¼å ¶ä¸æ³è¦è°ç¨çæºå¨å¦ä¹ ç®æ³åºè¯¥å¯ä»¥ä½ä¸ºä¸ä¸ªå½æ°å»å®ç°ãæ¯å¦ï¼
SELECT KNN_result
FROM (
SELECT
KNN_select(features1,features2,features3),
KNN_parameter(n_neighbors=5ï¼radius='auto',leaf_size=30),
FROM Table1
)t
以ä¸åæ³åªæ¯æç°å¨çæ³æ³å¹¶ä¸ä¸å®æç»å½¢å¼ä¼ä»¥è¿ç§è¯æ³åç°ãé£ä¹å¦ææ们è½å¤è§£æSQLè¯å¥çè¯ï¼å°±ä¸ç¨å»éå®SQLè¯å¥è¦å¦ä½ç¼åäºï¼è¿æ ·ä¸æ¥ä½¿ç¨SQLçæ°æ®åæå¸æè æ¯å¤§æ°æ®å·¥ç¨å¸é½å¯ä»¥å¾å¥½ç使ç¨è¿ä¸ªåè½ï¼ä¸ç¨åè±è´¹å ¶ä»å¦ä¹ ææ¬å»å¦ä¹ å¦ä¸ç§æ°çè¯è¨ãé£ä¹è¿éæ们å¾å®¹ææ³å°è¿ä¸ªç³»ç»æ主è¦çåè½å¨äºè¿ä¸ªSQL解æï¼æåçç®æ æ¯éè¦å°åæ°ä¼ å ¥sklearnçï¼å æ¤è§£ææ´ä¸ªSQLå°æ¯å ³é®ã
第äºæ¥
第äºæ¥å°±æ¯å ³é®æå¨äºï¼å¦ä½å»è§£ææ´ä¸ªSQLè¯å¥ï¼èä¸æ好æ¯è½å¤ç´æ¥å¨Pythonéé¢å°±è½å¤è§£æåºæ¥ãè¿æ ·çè¯å¯ä»¥ç´æ¥å°åæ°ä¼ å ¥è¯»åSQLçpythonèæ¬ï¼ä»èå»è¿æ¥çº¿ä¸çæ°æ®åºãé£ä¹è¿éæ好æ¯ä»¥å¹³å°ãæ°æ®ä¸å°çå½¢å¼å»éæè¿æ ·çä¸ä¸ªåè½ãå½ç¶ç´æ¥éè¿pymysqlï¼pyhiveé½å¯ä»¥å®ç°è¿æ¥æ°æ®åºè¿è¡äº¤äºãéè¿è§£æåçSQLæ°æ®åä¸ä¸ªç®åçæåä¹åï¼ä¸æ°æ®åºåå¾è¿æ¥åå°è¦æ±çç¹å¾åæ°æ®åºä»¥åè¡¨ä¼ å ¥æ°æ®åºSQLè¿è¡æ¥è¯¢ï¼åéè¿read_sqlä¿åä½ä¸ºä¸ä¸ªdataframeè¾åºã
pandas.read_sql(
sql,
con,
index_col=None,
coerce_float=True,
params=None,
parse_dates=None,
columns=None,
chunksize=None)
å¤å¶ä»£ç
æä¹ååè¿ä¸ç³»åå ³äºSQL解æçæç« ï¼ä¸»è¦æ¯åºäºPythonè¯è¨çSqlParseåºè¿è¡SQL解æï¼è¯¥é¡¹ç®å·²ç»å®æåçç大è´åè½ï¼è½å¤è§£ææ¯è¾å¤æçSQLè¯å¥å¹¶ä¸å¯ä»¥è·åå ¶ç¸åºçå段ã
以æ¤ä½ä¸ºæ¯ææ们å¯ä»¥è·åå°æºå¨å¦ä¹ ç®æ³çå ³é®å段ï¼ä¸sklearnå ³èèµ·æ¥ï¼å³å¯å®ææºå¨å¦ä¹ ç®æ³æåã
å ¶å®è¿ä¹å¤ççè¯ï¼ä» éè¦è¯»åå ³é®å段ä¹åä¿åï¼èåæ¬çSQLè¯å¥å°ä¼ å ¥æ°æ®åºè·åå¾å°çæ°æ®ä½ä¸ºæ°æ®éä¼ åºå°±å¥½äºãéè¿æä¹ååçä¸ç³»åæç« pyhiveæè æ¯pymysqlè¿æ¥æ°æ®åºå¾å®¹æå®ç°è¿ä¸ªåè½ã
# Connect to the database
connection = pymysql.connect(host='localhost',
user='user',
pswr='xxx',
database='db',
port = '3306'
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
第ä¸æ¥
第ä¸æ¥å°±æ¯å®ç°å¦ä½å°æååºæ¥çå ³é®å段ä¸æºå¨å¦ä¹ çsklearnçç®æ³å ³èèµ·æ¥äºãsklearnçå¾å¤æ¹æ³å¯ä»¥å¯ä»¥éè¿åå½æ°æ¥å£æ¥ä¸æååºæ¥çå½æ°åè½å段å¹é å®ç°ä¸åçåè½ãè¿ä¸ç®é¾ï¼åºè¯¥æ¯æ¯è¾å¥½å®ç°çåè½ï¼å ³é®å¨äºå ¶å¯¹äºsklearnçè°ç¨ææ¡£åSQLæºå¨å¦ä¹ å½æ°ææ¡£ç»åæ¯ä¸å¤§é¾ç¹ã该以æä¹æ ·çå½¢å¼ä¼ å ¥ï¼å该å¦ä½è°ç¨è¿ä¸ªç®æ³åå ¶å¯¹åºçåæ°ï¼è¿æ¯éè¦è±è´¹ç²¾åå»åè¿äºææ¡£ã
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
éè¿è·åå°çDataFrameä¼ å ¥æºå¨å¦ä¹ ç®æ³å¾å°ç»æã
LR=LogisticRegression()
LR.fit(X=X_train,y=Y_train)
predictions=LR.predict(X_test)
print(accuracy_score(Y_test,predictions))
print(confusion_matrix(Y_test,predictions))
print(classification_report(Y_test,predictions))
第åæ¥
è¾åºçç»æä¸è¬å°±ä¸ºDataFrameï¼æ¤å»ä¸è¬ä¼äº§ç两ç§è¡¨æ ¼ï¼ä¸ç±»ä¸ºç´æ¥è¾åºçç»æ表ï¼å å«è®¡ç®æè æ¯é¢æµåçç»æ表ï¼å¦ä¸ç±»å°±æ¯åºäºååçæµè¯éå¾å°çROC,AUCï¼åç¡®ççæ°æ®ï¼æ¤ç±»è¡¨æ ¼éè¦å¦å¤åå»ºå¼ æ°è¡¨ä¿åãä¹å°±æ¯è¯´è¿åç»ææä¸¤å¼ è¡¨æ ¼ï¼è¿ä¸¤å¼ è¡¨æ ¼é½å¯ä»¥åºäºto_sqlçæ¹æ³ï¼è½¬æ¢ä¸ºDataFrameåå ¥æ°æ®åºå°±å¯è¾¾å°ã
DataFrame.to_sql(name, con, schema=None, if_exists='fail',
index=True, index_label=None, chunksize=None, dtype=None)
from sqlalchemy import create_engine
import sqlalchemy
import pymysql
import pandas as pd
import datetime
from sqlalchemy.types import INT,FLOAT,DATETIME,BIGINT
date_now=datetime.datetime.now()
data={'id':[888,889],
'code':[1003,1004],
'value':[2000,2001],
'time':[20220609,20220610],
'create_time':[date_now,date_now],
'update_time':[date_now,date_now],
'source':['python','python']}
insert_df=pd.DataFrame(data)
schema_sql={ 'id':INT,
'code': INT,
'value': FLOAT(20),
'time': BIGINT,
'create_time': DATETIME(50),
'update_time': DATETIME(50)
}
insert_df.to_sql('create_two',engine,if_exists='replace',index=False,dtype=schema_sql)