TF2.0-結構化資料模組化流程範例

以下文章來源于Python與算法之美，作者梁雲1991

盡管TensorFlow設計上足夠靈活，可以用于進行各種複雜的數值計算。但通常人們使用TensorFlow來實作機器學習模型，尤其常用于實作神經網絡模型。

從原理上說可以使用張量建構計算圖來定義神經網絡，并通過自動微分機制訓練模型。但為簡潔起見，一般推薦使用TensorFlow的高層次keras接口來實作神經網絡網模型。

使用TensorFlow實作神經網絡模型的一般流程包括：

1，準備資料

2，定義模型

3，訓練模型

4，評估模型

5，使用模型

6，儲存模型。

對新手來說，其中最困難的部分實際上是準備資料過程。

我們在實踐中通常會遇到的資料類型包括結構化資料，圖檔資料，文本資料。

我們将分别以titanic生存預測問題，cifar2圖檔分類問題，imdb電影評論分類問題為例，示範應用tensorflow對這三類資料的模組化方法。

本篇以titanic生存預測問題為例，示範應用tensorflow對結構化資料進行模組化的方法。

一，準備資料

titanic資料集的目标是根據乘客資訊預測他們在Titanic号撞擊冰山沉沒後能否生存。

結構化資料一般會使用Pandas中的DataFrame進行預處理。

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import tensorflow as tf 
from tensorflow.keras import models,layers

dftrain_raw = pd.read_csv('./data/titanic/train.csv')
dftest_raw = pd.read_csv('./data/titanic/test.csv')
dftrain_raw.head(10)

複制

字段說明：

Survived:0代表死亡，1代表存活【y标簽】
Pclass:乘客所持票類，有三種值(1,2,3) 【轉換成onehot編碼】
Name:乘客姓名【舍去】
Sex:乘客性别【轉換成bool特征】
Age:乘客年齡(有缺失) 【數值特征，添加“年齡是否缺失”作為輔助特征】
SibSp:乘客兄弟姐妹/配偶的個數(整數值) 【數值特征】
Parch:乘客父母/孩子的個數(整數值)【數值特征】
Ticket:票号(字元串)【舍去】
Fare:乘客所持票的價格(浮點數，0-500不等) 【數值特征】
Cabin:乘客所在船艙(有缺失) 【添加“所在船艙是否缺失”作為輔助特征】
Embarked:乘客登船港口:S、C、Q(有缺失)【轉換成onehot編碼，四次元 S,C,Q,nan】

利用Pandas的資料可視化功能我們可以簡單地進行探索性資料分析EDA（Exploratory Data Analysis）。

label分布情況

%matplotlib inline
%config InlineBackend.figure_format = 'png'
ax = dftrain_raw['Survived'].value_counts().plot(kind = 'bar',
     figsize = (12,8),fontsize=15,rot = 0)
ax.set_ylabel('Counts',fontsize = 15)
ax.set_xlabel('Survived',fontsize = 15)
plt.show()

複制

年齡分布情況

%matplotlib inline
%config InlineBackend.figure_format = 'png'
ax = dftrain_raw['Age'].plot(kind = 'hist',bins = 20,color= 'purple',
                    figsize = (12,8),fontsize=15)

ax.set_ylabel('Frequency',fontsize = 15)
ax.set_xlabel('Age',fontsize = 15)
plt.show()

複制

年齡和label的相關性

%matplotlib inline
%config InlineBackend.figure_format = 'png'
ax = dftrain_raw.query('Survived == 0')['Age'].plot(kind = 'density',
                      figsize = (12,8),fontsize=15)
dftrain_raw.query('Survived == 1')['Age'].plot(kind = 'density',
                      figsize = (12,8),fontsize=15)
ax.legend(['Survived==0','Survived==1'],fontsize = 12)
ax.set_ylabel('Density',fontsize = 15)
ax.set_xlabel('Age',fontsize = 15)
plt.show()

複制

下面為正式的資料預處理

def preprocessing(dfdata):

    dfresult= pd.DataFrame()

    #Pclass
    dfPclass = pd.get_dummies(dfdata['Pclass'])
    dfPclass.columns = ['Pclass_' +str(x) for x in dfPclass.columns ]
    dfresult = pd.concat([dfresult,dfPclass],axis = 1)

    #Sex
    dfSex = pd.get_dummies(dfdata['Sex'])
    dfresult = pd.concat([dfresult,dfSex],axis = 1)

    #Age
    dfresult['Age'] = dfdata['Age'].fillna(0)
    dfresult['Age_null'] = pd.isna(dfdata['Age']).astype('int32')

    #SibSp,Parch,Fare
    dfresult['SibSp'] = dfdata['SibSp']
    dfresult['Parch'] = dfdata['Parch']
    dfresult['Fare'] = dfdata['Fare']

    #Carbin
    dfresult['Cabin_null'] =  pd.isna(dfdata['Cabin']).astype('int32')

    #Embarked
    dfEmbarked = pd.get_dummies(dfdata['Embarked'],dummy_na=True)
    dfEmbarked.columns = ['Embarked_' + str(x) for x in dfEmbarked.columns]
    dfresult = pd.concat([dfresult,dfEmbarked],axis = 1)

    return(dfresult)

x_train = preprocessing(dftrain_raw)
y_train = dftrain_raw['Survived'].values

x_test = preprocessing(dftest_raw)
y_test = dftest_raw['Survived'].values

print("x_train.shape =", x_train.shape )
print("x_test.shape =", x_test.shape )

複制

二，定義模型

使用Keras接口有以下3種方式構模組化型：使用Sequential按層順序構模組化型，使用函數式API建構任意結構模型，繼承Model基類建構自定義模型。

此處選擇使用最簡單的Sequential，按層順序模型。

tf.keras.backend.clear_session()

model = models.Sequential()
model.add(layers.Dense(20,activation = 'relu',input_shape=(15,)))
model.add(layers.Dense(10,activation = 'relu' ))
model.add(layers.Dense(1,activation = 'sigmoid' ))

model.summary()

複制

三，訓練模型

訓練模型通常有3種方法，内置fit方法，内置train_on_batch方法，以及自定義訓練循環。此處我們選擇最常用也最簡單的内置fit方法。

# 二分類問題選擇二進制交叉熵損失函數
model.compile(optimizer='adam',
            loss='binary_crossentropy',
            metrics=['AUC'])

history = model.fit(x_train,y_train,
                    batch_size= 64,
                    epochs= 30,
                    validation_split=0.2 #分割一部分訓練資料用于驗證
                   )

複制

四，評估模型

我們首先評估一下模型在訓練集和驗證集上的效果。

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import matplotlib.pyplot as plt

def plot_metric(history, metric):
    train_metrics = history.history[metric]
    val_metrics = history.history['val_'+metric]
    epochs = range(1, len(train_metrics) + 1)
    plt.plot(epochs, train_metrics, 'bo--')
    plt.plot(epochs, val_metrics, 'ro-')
    plt.title('Training and validation '+ metric)
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend(["train_"+metric, 'val_'+metric])
    plt.show()

複制

我們再看一下模型在測試集上的效果.

五，使用模型

六，儲存模型

可以使用Keras方式儲存模型，也可以使用TensorFlow原生方式儲存。前者僅僅适合使用Python環境恢複模型，後者則可以跨平台進行模型部署。

推薦使用後一種方式進行儲存。

1，Keras方式儲存

2，TensorFlow原生方式儲存