Libsvm 資料 DNN 訓練—從 Keras 到 Estimator

背景

手上有個 Libsvm 格式資料集，已經跑過 LR 和 GBDT，想快速看下 DNN 的效果？那本文正适合你。

盡管深度學習研究和應用的熱潮已持續高漲多年，TensorFlow 早已為算法同學所熟知，但并非所有人都對這個工具駕輕就熟，要在個人資料集上跑個簡易 DNN 模型出來也不是頃刻間的事，特别是當資料集是 Libsvm 格式時。Libsvm 是機器學習常用格式，很多工具包括 Liblinear、XGBoost、LightGBM、ytk-learn、xlearn 都支援，但 Tensorflow 官方及民間均未見提供優雅的解決方案，這給新手造成了諸多不便，對應用如此廣泛的工具來說是個遺憾。對此，本文提供了經過充分驗證的解決方案（some code），相信可以幫助新同學節省些時間。

簡介

本文代碼可用于：

快速驗證 Libsvm 資料集在 DNN 上的效果，以與其它線性模型或樹模型做對比，探索模型的極限。
對高維特征做降維，可取第一隐層的輸出作為 embedding，加入到其它訓練過程中。
新手入門，學習 Tensorflow keras、Estimator 和 Dataset 的使用。

本次編碼遵循如下原則：

盡量不自己造輪子，盡量用官方的或其它公認性能最好的代碼，除非迫不得已。
代碼盡量精簡。
追求極緻的時間複雜度和空間複雜度。

本文隻介紹最初級的 DNN 多分類訓練評估代碼，其它更高階複雜模型可參考

DeepCTR

等優秀的開源項目，後續會另外發文分享這些複雜模型在實際調研中的應用。

下面是 Tensorflow 針對 Libsvm 資料訓練 DNN 的四個進階代碼及其思路，推薦使用後兩者。

Keras generator

這裡面臨三個選擇：

Tensorflow API：要用 Tensorflow 建構個 DNN 模型，對熟手來說很容易，用低階 API 也能馬上建個 DNN，隻是代碼略顯雜亂，相比之下，高階 API Keras 就貼心得多，代碼極度精簡，一目了然。
Libsvm 資料讀取：手寫個 Libsvm 格式資料的讀取很容易，讀取稀疏編碼轉成稠密編碼，但既然 sklearn 已經有 load_svmlight_file 了為什麼不用呢，該函數會讀進整個檔案，當然小資料量不是問題。
fit 和 fit_generator：Keras 模型訓練隻接收稠密編碼，而 Libsvm 是稀疏編碼，如果資料集不算太大，通過 load_svmlight_file 全部讀進記憶體也能接受，但要先全部轉成稠密編碼再喂給 fit，那記憶體可能會爆掉；理想方案是用多少讀多少，讀進來再轉換，此處圖省事就先用 load_svmlight_file 全部讀進來以稀疏編碼儲存，使用時再分批喂給 fit_generator。

代碼如下：

import numpy as np
from sklearn.datasets import load_svmlight_file
from tensorflow import keras
import tensorflow as tf

feature_len = 100000 # 特征次元，下面使用時可替換成 X_train.shape[1]
n_epochs = 1
batch_size = 256
train_file_path = './data/train_libsvm.txt'
test_file_path = './data/test_libsvm.txt'

def batch_generator(X_data, y_data, batch_size):
    number_of_batches = X_data.shape[0]/batch_size
    counter=0
    index = np.arange(np.shape(y_data)[0])
    while True:
        index_batch = index[batch_size*counter:batch_size*(counter+1)]
        X_batch = X_data[index_batch,:].todense()
        y_batch = y_data[index_batch]
        counter += 1
        yield np.array(X_batch),y_batch
        if (counter > number_of_batches):
            counter=0

def create_keras_model(feature_len):
    model = keras.Sequential([
        # 可在此添加隐層
        keras.layers.Dense(64, input_shape=[feature_len], activation=tf.nn.tanh),
        keras.layers.Dense(6, activation=tf.nn.softmax)
    ])
    model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
    return model

if __name__ == "__main__":
    X_train, y_train = load_svmlight_file(train_file_path)
    X_test, y_test = load_svmlight_file(test_file_path)

    keras_model = create_keras_model(X_train.shape[1])

    keras_model.fit_generator(generator=batch_generator(X_train, y_train, batch_size = batch_size),
                    steps_per_epoch=int(X_train.shape[0]/batch_size),
                    epochs=n_epochs)
    
    test_loss, test_acc = keras_model.evaluate_generator(generator=batch_generator(X_test, y_test, batch_size = batch_size),
                    steps=int(X_test.shape[0]/batch_size))
    print('Test accuracy:', test_acc)

以上即早前實際調研中使用的代碼，完成當時的訓練任務夠使了，但該代碼的缺點顯而易見，一方面空間複雜度太差，大資料常駐記憶體會影響其它程序，當遇到大資料集時就無能為力了，另一方面可用性差，資料分批需在 batch_generator 手動編碼實作，調試耗費時間，也容易出錯。

Tensorflow Dataset 是個完美的解決方案，不過由于之前對 Dataset 不熟，也不知道如何用 TF 低階 API 解析 libsvm 并把 SparseTensor 轉成 DenseTensor，當時時間有限就擱置了，後來才解決該問題，重點即下面代碼中的 decode_libsvm 函數。

把 libsvm 轉成 Dataset 後，DNN 才得到解鎖，可以自由運作在任意大資料集上了。

下面依次介紹了 Dataset 應用在 Keras model、Keras to estimator、DNNClassifier。

附 embedding 代碼，第一個隐層的輸出作為 embedding：

def save_output_file(output_array, filename):
    result = list()
    for row_data in output_array:
        line = ','.join([str(x) for x in row_data.tolist()])
        result.append(line)
    with open(filename,'w') as fw:
        fw.write('%s' % '\n'.join(result))
        
X_test, y_test = load_svmlight_file("./data/test_libsvm.txt")
model = load_model('./dnn_onelayer_tanh.model')
dense1_layer_model = Model(inputs=model.input, outputs=model.layers[0].output)
dense1_output = dense1_layer_model.predict(X_test)
save_output_file(dense1_output, './hidden_output/hidden_output_test.txt')

Keras Dataset

将 libsvm 資料讀取從 load_svmlight_file 改成 dataset 并 decode_libsvm。

import numpy as np
from sklearn.datasets import load_svmlight_file
from tensorflow import keras
import tensorflow as tf

feature_len = 138830
n_epochs = 1
batch_size = 256
train_file_path = './data/train_libsvm.txt'
test_file_path = './data/test_libsvm.txt'

def decode_libsvm(line):
    columns = tf.string_split([line], ' ')
    labels = tf.string_to_number(columns.values[0], out_type=tf.int32)
    labels = tf.reshape(labels,[-1])
    splits = tf.string_split(columns.values[1:], ':')
    id_vals = tf.reshape(splits.values,splits.dense_shape)
    feat_ids, feat_vals = tf.split(id_vals,num_or_size_splits=2,axis=1)
    feat_ids = tf.string_to_number(feat_ids, out_type=tf.int64)
    feat_vals = tf.string_to_number(feat_vals, out_type=tf.float32)
    # 由于 libsvm 特征編碼從 1 開始，這裡需要将 feat_ids 減 1
    sparse_feature = tf.SparseTensor(feat_ids-1, tf.reshape(feat_vals,[-1]), [feature_len])
    dense_feature = tf.sparse.to_dense(sparse_feature)
    return dense_feature, labels

def create_keras_model():
    model = keras.Sequential([
        keras.layers.Dense(64, input_shape=[feature_len], activation=tf.nn.tanh),
        keras.layers.Dense(6, activation=tf.nn.softmax)
    ])
    model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
    return model

if __name__ == "__main__":
    dataset_train = tf.data.TextLineDataset([train_file_path]).map(decode_libsvm).batch(batch_size).repeat()
    dataset_test = tf.data.TextLineDataset([test_file_path]).map(decode_libsvm).batch(batch_size).repeat()

    keras_model = create_keras_model()

    sample_size = 10000 # 由于訓練函數必須要指定 steps_per_epoch，是以這裡需要先擷取到樣本數
    keras_model.fit(dataset_train, steps_per_epoch=int(sample_size/batch_size), epochs=n_epochs)
    
    test_loss, test_acc = keras_model.evaluate(dataset_test, steps=int(sample_size/batch_size))
    print('Test accuracy:', test_acc)

解決了空間複雜度高的問題，資料輕輕地來，輕輕地去，不占用大量記憶體。

不過可用性上仍有兩點不便：

keras fit 時需指定 steps_per_epoch，為了保證每一輪走完整批資料，需要實作計算 sample size，不合理，其實 dataset 的 repeat 就可以保證，用 Estimator 就沒有必須指定 steps_per_epoch 的限制。
需事先計算特征次元 feature_len，由于 libsvm 是稀疏編碼，隻讀取一行或幾行無法推斷特征次元，可先離線用 load_svmlight_file 擷取特征次元 feature_len=X_train.shape[1]，然後寫死在代碼裡。這是 libsvm 的固有特點，隻能如此處理了。

Keras model to Estimator

Tensorflow 的另一個高階 API 是 Estimator，更加靈活，據說單機和分布式代碼一緻，且不用考慮底層的硬體設施，可以比較友善地和一些分布式排程架構（e.g. xlearning）結合使用，在工作中也發現 Estimator 比 Keras 能得到平台更全面的支援。

(

https://intranetproxy.alipay.com/skylark/lark/0/2019/png/189544/1559021809231-5553af74-6bcf-41ad-94b3-9095059f25a7.png)

Estimator 是跟 Keras 互相獨立的高階 API，如果之前用的是 Keras，一時半會不能全部重構成 Estimator， TF 還提供了 Keras 的 model_to_estimator 接口，也可以享受到 Estimator 帶來的好處。

from tensorflow import keras
import tensorflow as tf
from tensorflow.python.platform import tf_logging
# 打開 estimator 日志，可在訓練時輸出日志，了解進度
tf_logging.set_verbosity('INFO')

feature_len = 100000
n_epochs = 1
batch_size = 256
train_file_path = './data/train_libsvm.txt'
test_file_path = './data/test_libsvm.txt'

# 注意這裡多了個參數 input_name，傳回值也與上不同
def decode_libsvm(line, input_name):
    columns = tf.string_split([line], ' ')
    labels = tf.string_to_number(columns.values[0], out_type=tf.int32)
    labels = tf.reshape(labels,[-1])
    splits = tf.string_split(columns.values[1:], ':')
    id_vals = tf.reshape(splits.values,splits.dense_shape)
    feat_ids, feat_vals = tf.split(id_vals,num_or_size_splits=2,axis=1)
    feat_ids = tf.string_to_number(feat_ids, out_type=tf.int64)
    feat_vals = tf.string_to_number(feat_vals, out_type=tf.float32)
    sparse_feature = tf.SparseTensor(feat_ids-1, tf.reshape(feat_vals,[-1]),[feature_len])
    dense_feature = tf.sparse.to_dense(sparse_feature)
    return {input_name: dense_feature}, labels

def input_train(input_name):
    # 這裡使用 lambda 來給 map 中的 decode_libsvm 函數添加除 line 之的參數
    return tf.data.TextLineDataset([train_file_path]).map(lambda line: decode_libsvm(line, input_name)).batch(batch_size).repeat(n_epochs).make_one_shot_iterator().get_next()

def input_test(input_name):
    return tf.data.TextLineDataset([train_file_path]).map(lambda line: decode_libsvm(line, input_name)).batch(batch_size).make_one_shot_iterator().get_next()

def create_keras_model(feature_len):
    model = keras.Sequential([
        # 可在此添加隐層
        keras.layers.Dense(64, input_shape=[feature_len], activation=tf.nn.tanh),
        keras.layers.Dense(6, activation=tf.nn.softmax)
    ])
    model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
    return model

def create_keras_estimator():
    model = create_keras_model()
    input_name = model.input_names[0]
    estimator = tf.keras.estimator.model_to_estimator(model)
    return estimator, input_name

if __name__ == "__main__":
    keras_estimator, input_name = create_keras_estimator(feature_len)
    keras_estimator.train(input_fn=lambda:input_train(input_name))
    eval_result = keras_estimator.evaluate(input_fn=lambda:input_train(input_name))
    print(eval_result)

這裡不用 sample_size 了，但 feature_len 還是須事先計算。注意到 Estimator 的 input_fn 傳回的 dict key 需要跟 model 的輸入名保持一緻，這裡通過 input_name 傳遞該值。

用 Keras 的人很多，很多開源項目也用 Keras 來搭建複雜模型，由于 Keras 的模型格式特别，部分平台不支援儲存，但提供了對 Estimator 的模型儲存支援，這時正好可以使用 model_to_estimator 來儲存 Keras 模型，非常友善。

DNNClassifier

最後來直接使用 Tensorflow 預建立的 Estimator：DNNClassifier。

import tensorflow as tf
from tensorflow.python.platform import tf_logging
# 打開 estimator 日志，可在訓練時輸出日志，了解進度
tf_logging.set_verbosity('INFO')

feature_len = 100000
n_epochs = 1
batch_size = 256
train_file_path = './data/train_libsvm.txt'
test_file_path = './data/test_libsvm.txt'

def decode_libsvm(line, input_name):
    columns = tf.string_split([line], ' ')
    labels = tf.string_to_number(columns.values[0], out_type=tf.int32)
    labels = tf.reshape(labels,[-1])
    splits = tf.string_split(columns.values[1:], ':')
    id_vals = tf.reshape(splits.values,splits.dense_shape)
    feat_ids, feat_vals = tf.split(id_vals,num_or_size_splits=2,axis=1)
    feat_ids = tf.string_to_number(feat_ids, out_type=tf.int64)
    feat_vals = tf.string_to_number(feat_vals, out_type=tf.float32)
    sparse_feature = tf.SparseTensor(feat_ids-1,tf.reshape(feat_vals,[-1]),[feature_len])
    dense_feature = tf.sparse.to_dense(sparse_feature)
    return {input_name: dense_feature}, labels

def input_train(input_name):
    return tf.data.TextLineDataset([train_file_path]).map(lambda line: decode_libsvm(line, input_name)).batch(batch_size).repeat(n_epochs).make_one_shot_iterator().get_next()

def input_test(input_name):
    return tf.data.TextLineDataset([train_file_path]).map(lambda line: decode_libsvm(line, input_name)).batch(batch_size).make_one_shot_iterator().get_next()

def create_dnn_estimator():
    input_name = "dense_input"
    feature_columns = tf.feature_column.numeric_column(input_name, shape=[feature_len])
    estimator = tf.estimator.DNNClassifier(hidden_units=[64],
                                           n_classes=6,
                                           feature_columns=[feature_columns])
    return estimator, input_name

if __name__ == "__main__":
    dnn_estimator, input_name = create_dnn_estimator()
    dnn_estimator.train(input_fn=lambda:input_train(input_name))

    eval_result = dnn_estimator.evaluate(input_fn=lambda:input_test(input_name))
    print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))

Estimator 代碼邏輯清晰，使用簡單，功能也很強大，關于 Estimator 的更多資訊可參考

官方文檔

，這裡不再贅述。

以上方案除第一個不便處理大資料，其它均可在單機運作，使用時可根據需求修改網絡結構、目标函數等。

本文代碼源自一個調研，耗費數小時調試，調研完成代碼即閑置，現不計鄙陋，抛磚引玉，希望能對其它同學有所幫助。

Libsvm 資料 DNN 訓練—從 Keras 到 Estimator

背景

簡介

Keras generator

Keras Dataset

Keras model to Estimator

DNNClassifier

繼續閱讀

線上教育巨頭多鄰國Duolingo入華一周年，中國市場馬力全開

【分類算法】什麼是分類算法定義分類與聚類分類過程方法

K-近鄰算法以及圖像分類應用

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

GNU科學函數庫[參考手冊][v0.1 Build 090129 Beta][GNU Scientific Library]

與專家面對面：Android開發入門問與答

Sql優化一：sql語句優化

Nacos 2.0 更新前後性能對比壓測

尚矽谷—韓順平—圖解 Java設計模式（結構型）（55～）

Storm編譯打包過程中遇到的一些問題及解決方法

MapReduce的幾個企業級經典面試案例MapReduce的幾個企業級經典面試案例

9.spark Core 進階2--Cashe

淺談企業活動中進行資料分析的重要性

Ambari介紹和架構原理

NOSQL安全攻擊

win10本地scala和spark安裝安裝scala安裝spark