10:05 2014-10-08
start CalTech machine learning, video 11
regularization
10:05 2014-10-08
overfitting: we're fitting the data all too well
at the expense of the out-of-sample expense
10:06 2014-10-08
if you think of what the VC analysis told us,
the VC analysis told us that given the data resources &
the complexity of the hypothese set with nothing said
about the target,given those we can predict the level
of generalization as a bound.
10:08 2014-10-08
data resouce + VC dimension => level of generalization
10:09 2014-10-08
source of the overfitting is to fit the noise
11:15 2014-10-08
stochastic noise/deterministic noise
11:16 2014-10-08
deterministic noise is the function of limitation of your model
11:17 2014-10-08
1st cure for overfitting
11:18 2014-10-08
outline:
* Regularization - informal
* Regularization - formal
* Weight decay
* Choosiing a regulizer
11:19 2014-10-08
unconstrained solution: minimize Ein //in-sample error
11:37 2014-10-08
let's look at the constrained version, what happens
if we constaining the weights
11:40 2014-10-08
so here is the constaint we're going to work with
11:41 2014-10-08
I have a smaller hypothese set, so the VC dimension is going
in the direction of being smaller, so I'm standing at the point
of better generalization.
11:44 2014-10-08
constraining the weights:
* Hard constaint
* Softer-order constaint
11:45 2014-10-08
Wreg instead of Win
11:46 2014-10-08
you are minimize this subject to the constaint
11:47 2014-10-08
KKT
11:47 2014-10-08
I have 2 thing here, I have the error surface trying
to minimize, and I have the constraint.
11:49 2014-10-08
I'm going to put contours where the in-sample error is constant
11:50 2014-10-08
let's take a point on the surface
11:52 2014-10-08
let's look at the gradient of the objective function
11:53 2014-10-08
gradient of the objective function will give me a good
idea about the direction to move in order to minimize the
objective function
11:54 2014-10-08
move along the circle will change the value of Ein
11:58 2014-10-08
Augmented error: Eaug(w)
12:05 2014-10-08
regularization term
12:06 2014-10-08
I use a subset of the hypothese set and I expect good
generalization.
12:07 2014-10-08
one step learning including regularization
12:13 2014-10-08
let's apply it and see the results in the real case.
12:14 2014-10-08
so the medicine is working, a small dose of medicine
did the job
12:16 2014-10-08
I think we're overdosing here.
12:16 2014-10-08
if you keep increasing λ => overdose !!!
12:17 2014-10-08
the choice of λis extremely critical
12:18 2014-10-08
the good new is that this will be a heuristic choice
12:18 2014-10-08
the choice of λwill be extremely principled based on validation
12:19 2014-10-08
we went to another extreme: now we're "underfitting"
12:20 2014-10-08
overfitting => underfitting
12:20 2014-10-08
the proper choice of λ is important
12:21 2014-10-08
the most famous regularizer is "weight decay"
12:21 2014-10-08
we know in neural network you don't have a neat
closed-form solution, you use gradient descent
12:22 2014-10-08
batch gradient descent => stochastic gradient descent(SGD)
12:23 2014-10-08
I'm in the weight space & this is my weight, and
here is the direction that backpropagation suggest to move to.
12:27 2014-10-08
used to be without regularization, I move from here to here
12:27 2014-10-08
shinking & moving
12:28 2014-10-08
the weight decay from this one to the next
12:29 2014-10-08
weight space
12:31 2014-10-08
some weights are more important than others
12:31 2014-10-08
low-order fit
12:33 2014-10-08
Tikhonov regularizer
12:34 2014-10-08
regularization parameter λ
12:36 2014-10-08
you have to use the regularizer, because without
the regularizer, you're going to get overfitting
12:38 2014-10-08
but there are guidelines to choose the regularizer
12:38 2014-10-08
after you choose the regularizer, there is a check of
the λ
12:39 2014-10-08
practical rule:
stochastic noise is 'high-frequency'
deterministic noise is also non-smooth
12:41 2014-10-08
because of this, here is the guideline for
choosing regularizer:
=> constrain learning towards smoother hypothese
12:42 2014-10-08
regularization is a cure, and the cure has a side-effect
12:42 2014-10-08
it's a cure for fitting the noise
12:43 2014-10-08
punishing the noise more than you punishing the signal
12:43 2014-10-08
in most of the parameterization, small weights correspond
to smoother hypothese, that's why small weights or 'weight decay'
works well in those cases.
12:45 2014-10-08
general form of augmented error
calling the regularizer Ω = Ω(h)
12:46 2014-10-08
we minimize
Eaug(h) = Ein(h) + λ/ N * Ω(h)
// this is what we minimize
12:47 2014-10-08
Eaug is better than Ein as a proxy for Eout
12:50 2014-10-08
augmented error(Eaug) is better than Ein for approximating Eout
12:51 2014-10-08
we found a better proxy for the out-of-sample(Eout)
12:51 2014-10-08
how we choose regularizer?
mainly a heuristic choice
12:52 2014-10-08
perfect hypothese set
12:52 2014-10-08
the perfect regularizer Ω:
constaint in the 'direction' of the target function
12:52 2014-10-08
regularization is an attempt to reduce overfitting
12:55 2014-10-08
harms the overfitting(noise) more than the fitting
12:56 2014-10-08
guidelines:
the direction of smoother
12:56 2014-10-08
we have the error function for the movie rating
12:57 2014-10-08
the notion of simple here is very interesting
12:59 2014-10-08
now you're regularizer to the simpler solution
13:04 2014-10-08
what happened if you choose a bad Ω? // Ω regularizer
we don't worry too much, because we have the saving grace
of λ, we're going to validation
13:06 2014-10-08
the validation will tell us it's harmful, we'll factor
the regularizer out of the game all together.
13:08 2014-10-08
neural network regularizer
13:09 2014-10-08
weight decay
13:09 2014-10-08
so we have this big network, layer upon layer upon layer...
13:11 2014-10-08
I'm looking at the functionalities that I'm implementing
13:12 2014-10-08
as you increase the weight, you're going to enter the more
interesting nonlinearity here.
13:12 2014-10-08
you're going from the most simple to the most complex
13:13 2014-10-08
weight decay: from linear to logical
13:13 2014-10-08
weight elimination:
fewer weights => smaller VC dimension
13:15 2014-10-08
early stopping as a regularizer
13:17 2014-10-08
regularization through the optimizer
13:18 2014-10-08
the optimal λ:
as you increase the noise, you need more regularization
-----------------------------------------------------
13:38 2014-10-08
there are regularizer stood the test of time
13:38 2014-10-08
machine learning is somewhere between theory & practice
13:39 2014-10-08