8:58 2014-10-09
start CalTech machine learning, vieo 13,
validation
9:57 2014-10-09
outline:
* validation set
* model selection
* cross validation
10:03 2014-10-09
Validation vs. regularization
Eout(h) = Ein(h) + overfit penalty
regularization estimates "overfit penalty"
validation estimates "Eout(h)"
10:08 2014-10-09
Eval(h) // validation error
// this will be a good estimate of the out-of-sample performance
10:13 2014-10-09
k is taken out of N
// validation set are different from training set
10:18 2014-10-09
K points => validation
N-K points => training
10:18 2014-10-09
Dval, Dtrain
10:22 2014-10-09
small K => best estimate
large K =>
10:26 2014-10-09
why not put K back into the original N?
10:26 2014-10-09
we call it validation because we use it to make choice
10:34 2014-10-09
Dval is used to make learning choices
If an estimate of Eout affects learning
10:36 2014-10-09
early stopping
10:36 2014-10-09
this is going up, I better stop here
10:37 2014-10-09
What is the difference?
* Test set is unbiased;
* validation set has optimistic bias
10:39 2014-10-09
e1 is an unbiased estimate of out-of-sample error
10:42 2014-10-09
unbiased mean the expected value is what should be
10:42 2014-10-09
Error estimates e1 & e2
Pick h ∈{h1, h2} with e = min(e1, e2)
what is the expectation of e: E(e)?
10:45 2014-10-09
now we realize that this is an optimistic bias
10:46 2014-10-09
fortunately to us the utility of validation in
machine learning is so light, that we're going to
swallow the bias
10:47 2014-10-09
so with this understanding, let's use validation for
model selection which validation set do
10:48 2014-10-09
the choice of λ happens to be a manifestation of this
10:48 2014-10-09
Using Dval more than once
10:49 2014-10-09
that's a choice between models
10:50 2014-10-09
they have a small minus because I'm traning on Dtrain
10:53 2014-10-09
so these are done without any validation, just train
on a reduced set.
10:53 2014-10-09
once I get them, I'm going to evaluate the performance
10:54 2014-10-09
these are "validation errors"
10:54 2014-10-09
your model selection is to look at these errors which
supposed to reflect the out-of-sample performance if you
use this as your final product
10:57 2014-10-09
you pick the smallest of them, now you have a bias
10:57 2014-10-09
now we realized it has an optimistic bias
10:58 2014-10-09
we're now going back to our full data set
10:58 2014-10-09
restore your D as we did it before
10:59 2014-10-09
so this is the algorithm for model selection
10:59 2014-10-09
so I'm going to run an experiment to show you the bias
11:00 2014-10-09
not because it has an inherent good performance, but
because you look for the one with a good performance
11:01 2014-10-09
validation set size
11:02 2014-10-09
and after that, I look the actual out-of-sample performance error
11:03 2014-10-09
I'd like to ask you 2 questions:
* why the curves goes up?
* why are the 2 curves getting closer together?
11:06 2014-10-09
because when I use more for validation, I use less
for training,
11:07 2014-10-09
how much bias depend on the factors, but the bias is there
11:11 2014-10-09
I'm using the validation set to estimate the Eout
11:12 2014-10-09
validation set(Dval) is used for "training" on the
"finalist model"
11:16 2014-10-09
if you have decent set(set size == K), then your estimate
will not be that far from Eout(out-sample-error)
11:25 2014-10-09
so I'm choosing when to stop
11:25 2014-10-09
the training of the network tries to choose the
weight of the network
11:27 2014-10-09
validation error is a reasonable estimate of the
out-of-sample error that we can rely on
11:28 2014-10-09
data contamination:
if you use the data for making choices, you're
contaminating it as far as the ability to make the
real performance
11:31 2014-10-09
contamination: optismistic(decpetive) bias
11:32 2014-10-09
you're trying to measure what is the level of contamination
11:33 2014-10-09
we have a great Ein, and we know Ein is no indication
of Eout, this has been contaminated to death
11:34 2014-10-09
when you go to the 'test set', this is totally clean,
there is no bias here
11:35 2014-10-09
Ein // in-sample error
Etest // out-of-sample error
Eval // validation set
11:36 2014-10-09
the validation set is in between, it's slightly
contaminated.
11:36 2014-10-09
now we go to 'cross validation', very sweet regime
11:38 2014-10-09
the dilemma about K
11:40 2014-10-09
the fluctuation around the estimate we want
11:39 2014-10-09
Eout(g) // g is the hypothesis we're going to report
11:42 2014-10-09
Eout(g-)
// this is the proper sample error but on the hypothese set
// on a reduced set
11:42 2014-10-09
Eout(g) ≈ Eout(g-) ≈ Eval(g-)
Eout(g) // this is what we want
Eout(g-) // this is unknown to me
Eval(g-) // this is what I'm working with
11:43 2014-10-09
I want K to be small so that: Eout(g) ≈ Eout(g-)
11:45 2014-10-09
but also I want K to be large, because Eout(g-) ≈ Eval(g-)
11:45 2014-10-09
can we have K both small & large?
11:46 2014-10-09
leave one out, leave more out
11:46 2014-10-09
I'm going to use N-1 points for training,
and 1 point for validation
11:47 2014-10-09
I'm going to create a reduced set from D, called Dn
11:48 2014-10-09
this one(the taken out) will be the one I use for validation
11:48 2014-10-09
let's look at the validation error
11:49 2014-10-09
in this case, the validation error is just 1 point
11:49 2014-10-09
what happens if I repeat this exercise for different
small n?
11:50 2014-10-09
so in spite of these are different hypotheses, the fact
that they come from different points (N-1),
11:53 2014-10-09
I'm going to define the cross validation error: Ecv
11:53 2014-10-09
the catch is that these are not independent,
each of them is affected by the other
11:55 2014-10-09
It's remarkablly in getting it
11:56 2014-10-09
let's just estimate the out-of-sample error
using the cross validation method
11:57 2014-10-09
and we take an average performance of these
as an indication of what will happen out of sample
12:01 2014-10-09
we're using only 2 points here, when we're done,
we're using 3 points
12:02 2014-10-09
but think of 99/100, who cares?
12:02 2014-10-09
so let's use this for model selection
12:02 2014-10-09
model selection using CV // CV == Cross Validation
12:03 2014-10-09
we're like to find a separating surface
12:07 2014-10-09
Ecv tracks Eout very nicely
12:09 2014-10-09
if I use it as a criteria for model choice
12:10 2014-10-09
let me cutoff at six, and see what the performance like?
// early stop
12:10 2014-10-09
without validation, I'm using the full model
12:11 2014-10-09
with validation, you stop at 6, because the
cross validation tells you do so, it's nice
smooth surface
12:12 2014-10-09
I don't care the in-sample error to go to zero,
that's harmful in some cases
12:12 2014-10-09
so now you can see why validation is seen in this
context as similar to regularization, it does the
same thing, it prevents overfitting, but it prevents
overfitting by estimating out-of-sample error(Eout)
rather than estimating something else
12:16 2014-10-09
seldom use leave one out in real problems,
12:18 2014-10-09
take more points for validation
12:18 2014-10-09
Leave more than one out
12:18 2014-10-09
what you do is you take your data set,
you just break it into several fold
12:18 2014-10-09
exactly the same, except for here,
here I'm taking a chunk
12:20 2014-10-09
this is what I recommend it to you:
10-fold cross validation
-----------------------------------------------
13:29 2014-10-09
both validation & cross validation have bias
for the same reason