What are training set, validation set and test set?

這三個名詞在機器學習領域的文章中極其常見，但很多人對他們的概念并不是特别清楚，尤其是後兩個經常被人混用。Ripley, B.D（1996）在他的經典專著Pattern Recognition and Neural Networks中給出了這三個詞的定義。

Training set: A set of examples used for learning, which is to fit the parameters [i.e., weights] of the classifier.

Validation set: A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network.

Test set: A set of examples used only to assess the performance [generalization] of a fully specified classifier.

顯然，training set是用來訓練模型或确定模型參數的，如ANN中權值等； validation set是用來做模型選擇（model selection），即做模型的最終優化及确定的，如ANN的結構；而 test set則純粹是為了測試已經訓練好的模型的推廣能力。當然，test set這并不能保證模型的正确性，他隻是說相似的資料用此模型會得出相似的結果。

但實際應用中，一般隻将資料集分成兩類，即training set 和test set，大多數文章并不涉及validation set。 Ripley還談到了Why separate test and validation sets?

1. The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model.

2. After assessing the final model with the test set, YOU MUST NOT tune the model any further.

It is rarely useful to have a NN simply memorize a set of data, since memorization can be done much more efficiently by numerous algorithms for table look-up. Typically, you want the NN to be able to perform accurately on new data, that is, to generalize. There seems to be no term in the NN literature for the set of all cases that you want to be able to generalize to. Statisticians call this set the "population". Tsypkin (1971) called it the "grand truth distribution," but this term has never caught on. Neither is there a consistent term in the NN literature for the set of cases that are available for training and evaluating an NN. Statisticians call this set the "sample". The sample is usually a subset of the population. (Neurobiologists mean something entirely different by "population," apparently some collection of neurons, but I have never found out the exact meaning. I am going to continue to use "population" in the statistical sense until NN researchers reach a consensus on some other terms for "population" and "sample"; I suspect this will never happen.) In NN methodology, the sample is often subdivided into "training", "validation", and "test" sets.

The distinctions among these subsets are crucial, but the terms "validation" and "test" sets are often confused. Bishop (1995), an indispensable reference on neural networks, provides the following explanation (p. 372): Since our goal is to find the network having the best performance on new data, the simplest approach to the comparison of different networks is to evaluate the error function using data which is independent of that used for training. Various networks are trained by minimization of an appropriate error function defined with respect to a training data set. The performance of the networks is then compared by evaluating the error function using an independent validation set, and the network having the smallest error with respect to the validation set is selected. This approach is called the hold out method. Since this procedure can itself lead to some overfitting to the validation set, the performance of the selected network should be confirmed by measuring its performance on a third independent set of data called a test set.

And there is no book in the NN literature more authoritative than Ripley (1996), from which the following definitions are taken (p.354):

Training set: A set of examples used for learning that is to fit the parameters [i.e., weights] of the classifier.

Validation set: A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network.

Test set: A set of examples used only to assess the performance [generalization] of a fully-specified classifier.

The literature on machine learning often reverses the meaning of "validation" and "test" sets. This is the most blatant example of the terminological confusion that pervades artificial intelligence research. The crucial point is that a test set, by the standard definition in the NN literature, is never used to choose among two or more networks, so that the error on the test set provides an unbiased estimate of the generalization error (assuming that the test set is representative of the population, etc.). Any data set that is used to choose the best of two or more networks is, by definition, a validation set, and the error of the chosen network on the validation set is optimistically biased. There is a problem with the usual distinction between training and validation sets. Some training approaches, such as early stopping, require a validation set, so in a sense, the validation set is used for training. Other approaches, such as maximum likelihood, do not inherently require a validation set. So the "training" set for maximum likelihood might encompass both the "training" and "validation" sets for early stopping. Greg Heath has suggested the term "design" set be used for cases that are used solely to adjust the weights in a network, while "training" set be used to encompass both design and validation sets. There is considerable merit to this suggestion, but it has not yet been widely adopted.

參考：

http://www.cppblog.com/guijie/archive/2008/07/29/57407.html

http://www.faqs.org/faqs/ai-faq/neural-nets/part1/

What are training set, validation set and test set?

繼續閱讀

weka中分類器算法

在weka中內建自己的算法

【多變量線性回歸】學習記錄序思路實作終

可變參數宏， Variadic Macros

申請評分模型拒絕推斷（RI）方法申請評分模型拒絕推斷（RI）方法

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

不用iconv函數實作UTF-8編碼轉換GB2312的PHP函數

SIP Presence (二)RFC 3265 - Session Initiation Protocol (SIP)-Specific Event Notification

坐标系統和投影變換在桌面産品中的應用

【趨高機器視覺】機器視覺技術原了解析及解決方案

吳恩達 coursera ML 第七課總結+作業答案前言目錄正文模型表示作業答案

XGBoost Plotting API以及GBDT組合特征實踐 XGBoost Plotting API以及GBDT組合特征實踐

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告

無元件上傳圖檔到資料庫中，最完整解決方案

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置