天天看點

CalTech machine learning, video 12 note(regularization)

10:05 2014-10-08

start CalTech machine learning, video 11

regularization

10:05 2014-10-08

overfitting: we're fitting the data all too well

at the expense of the out-of-sample expense

10:06 2014-10-08

if you think of what the VC analysis told us,

the VC analysis told us that given the data resources &

the complexity of the hypothese set with nothing said 

about the target,given those we can predict the level

of generalization as a bound.

10:08 2014-10-08

data resouce + VC dimension => level of generalization

10:09 2014-10-08

source of the overfitting is to fit the noise

11:15 2014-10-08

stochastic noise/deterministic noise

11:16 2014-10-08

deterministic noise is the function of limitation of your model

11:17 2014-10-08

1st cure for overfitting

11:18 2014-10-08

outline:

* Regularization - informal

* Regularization - formal

* Weight decay

* Choosiing a regulizer

11:19 2014-10-08

unconstrained solution: minimize Ein //in-sample error

11:37 2014-10-08

let's look at the constrained version, what happens

if we constaining the weights

11:40 2014-10-08

so here is the constaint we're going to work with

11:41 2014-10-08

I have a smaller hypothese set, so the VC dimension is going

in the direction of being smaller, so I'm standing at the point 

of better generalization.

11:44 2014-10-08

constraining the weights:

* Hard constaint

* Softer-order constaint

11:45 2014-10-08

Wreg instead of Win

11:46 2014-10-08

you are minimize this subject to the constaint

11:47 2014-10-08

KKT

11:47 2014-10-08

I have 2 thing here, I have the error surface trying

to minimize, and I have the constraint.

11:49 2014-10-08

I'm going to put contours where the in-sample error is constant

11:50 2014-10-08

let's take a point on the surface

11:52 2014-10-08

let's look at the gradient of the objective function

11:53 2014-10-08

gradient of the objective function will give me a good

idea about the direction to move in order to minimize the

objective function

11:54 2014-10-08

move along the circle will change the value of Ein

11:58 2014-10-08

Augmented error: Eaug(w)

12:05 2014-10-08

regularization term

12:06 2014-10-08

I use a subset of the hypothese set and I expect good 

generalization.

12:07 2014-10-08

one step learning including regularization

12:13 2014-10-08

let's apply it and see the results in the real case.

12:14 2014-10-08

so the medicine is working, a small dose of medicine

did the job

12:16 2014-10-08

I think we're overdosing here.

12:16 2014-10-08

if you keep increasing λ => overdose !!!

12:17 2014-10-08

the choice of λis extremely critical

12:18 2014-10-08

the good new is that this will be a heuristic choice

12:18 2014-10-08

the choice of λwill be extremely principled based on validation

12:19 2014-10-08

we went to another extreme: now we're "underfitting"

12:20 2014-10-08

overfitting => underfitting

12:20 2014-10-08

the proper choice of λ is important

12:21 2014-10-08

the most famous regularizer is "weight decay"

12:21 2014-10-08

we know in neural network you don't have a neat

closed-form solution, you use gradient descent

12:22 2014-10-08

batch gradient descent => stochastic gradient descent(SGD)

12:23 2014-10-08

I'm in the weight space & this is my weight, and 

here is the direction that backpropagation suggest to move to.

12:27 2014-10-08

used to be without regularization, I move from here to here

12:27 2014-10-08

shinking & moving

12:28 2014-10-08

the weight decay from this one to the next

12:29 2014-10-08

weight space

12:31 2014-10-08

some weights are more important than others

12:31 2014-10-08

low-order fit

12:33 2014-10-08

Tikhonov regularizer

12:34 2014-10-08

regularization parameter λ

12:36 2014-10-08

you have to use the regularizer, because without 

the regularizer, you're going to get overfitting

12:38 2014-10-08

but there are guidelines to choose the regularizer

12:38 2014-10-08

after you choose the regularizer, there is a check of

the λ

12:39 2014-10-08

practical rule:

stochastic noise is 'high-frequency'

deterministic noise is also non-smooth

12:41 2014-10-08

because of this, here is the guideline for

choosing regularizer:

=> constrain learning towards smoother hypothese

12:42 2014-10-08

regularization is a cure, and the cure has a side-effect

12:42 2014-10-08

it's a cure for fitting the noise

12:43 2014-10-08

punishing the noise more than you punishing the signal

12:43 2014-10-08

in most of the parameterization, small weights correspond 

to smoother hypothese, that's why small weights or 'weight decay'

works well in those cases.

12:45 2014-10-08

general form of augmented error

calling the regularizer Ω = Ω(h)

12:46 2014-10-08

we minimize 

Eaug(h) = Ein(h) + λ/ N * Ω(h) 

// this is what we minimize

12:47 2014-10-08

Eaug is better than Ein as a proxy for Eout

12:50 2014-10-08

augmented error(Eaug) is better than Ein for approximating Eout

12:51 2014-10-08

we found a better proxy for the out-of-sample(Eout)

12:51 2014-10-08

how we choose regularizer?

mainly a heuristic choice

12:52 2014-10-08

perfect hypothese set

12:52 2014-10-08

the perfect regularizer Ω:

constaint in the 'direction' of the target function

12:52 2014-10-08

regularization is an attempt to reduce overfitting

12:55 2014-10-08

harms the overfitting(noise) more than the fitting

12:56 2014-10-08

guidelines:

the direction of smoother

12:56 2014-10-08

we have the error function for the movie rating

12:57 2014-10-08

the notion of simple here is very interesting

12:59 2014-10-08

now you're regularizer to the simpler solution

13:04 2014-10-08

what happened if you choose a bad Ω? // Ω regularizer

we don't worry too much, because we have the saving grace 

of λ, we're going to validation

13:06 2014-10-08

the validation will tell us it's harmful, we'll factor

the regularizer out of the game all together.

13:08 2014-10-08

neural network regularizer

13:09 2014-10-08

weight decay

13:09 2014-10-08

so we have this big network, layer upon layer upon layer...

13:11 2014-10-08

I'm looking at the functionalities that I'm implementing

13:12 2014-10-08

as you increase the weight, you're going to enter the more

interesting nonlinearity here.

13:12 2014-10-08

you're going from the most simple to the most complex

13:13 2014-10-08

weight decay: from linear to logical

13:13 2014-10-08

weight elimination:

fewer weights => smaller VC dimension

13:15 2014-10-08

early stopping as a regularizer

13:17 2014-10-08

regularization through the optimizer

13:18 2014-10-08

the optimal λ:

as you increase the noise, you need more regularization

-----------------------------------------------------

13:38 2014-10-08

there are regularizer stood the test of time

13:38 2014-10-08

machine learning is somewhere between theory & practice

13:39 2014-10-08

繼續閱讀