likelihood function, as interpreted by wikipedia:
<a href="https://en.wikipedia.org/wiki/likelihood_function">https://en.wikipedia.org/wiki/likelihood_function</a>
plays one of the key roles in statistic inference, especially methods of estimating a parameter from a set of statistics. in this article, we’ll make full use of it.
pattern recognition works on the way that learning the posterior probability p(y|x) of pattern x belonging to class y. in view of a pattern x, when the posterior probability of one of the class y achieves the maximum, we can take x for class y, i.e.
y^=argmaxy=1,…,cp(u|x)
the posterior probability can be seen as the credibility of model x belonging to class y.
in logistic regression algorithm, we make use of linear logarithmic function to analyze the posterior probability:
q(y|x,θ)=exp(∑bj=1θ(y)jϕj(x))∑cy′=1exp(∑bj=1θ(y′)jϕj(x))
note that the denominator is a kind of regularization term. then the logistic regression is defined by the following optimal problem:
maxθ∑i=1mlogq(yi|xi,θ)
we can solve it by gradient descent method:
initialize θ.
pick up a training sample (xi,yi) randomly.
update θ=(θ(1)t,…,θ(c)t)t along the direction of gradient ascent:θ(y)←θ(y)+ϵ∇yji(θ),y=1,…,c
where ∇yji(θ)=−exp(θ(y)tϕ(xi))ϕ(xi)∑cy′=1exp(θ(y′)tϕ(xi))+{ϕ(xi)0(y=yi)(y≠yi)
go back to step 2,3 until we get a θ of suitable precision.
take the gaussian kernal model as an example:
q(y|x,θ)∝exp⎛⎝∑j=1nθjk(x,xj)⎞⎠
aren’t you familiar with gaussian kernal model? refer to this article:
<a href="http://blog.csdn.net/philthinker/article/details/65628280">http://blog.csdn.net/philthinker/article/details/65628280</a>
here are the corresponding matlab codes:
in ls probability classifiers, linear parameterized model is used to express the posterior probability:
q(y|x,θ(y))=∑j=1bθ(y)jϕj(x)=θ(y)tϕ(x),y=1,…,c
these models depends on the parameters θ(y)=(θ(y)1,…,θ(y)b)t correlated to each classes y that is diverse from the one used by logistic classifiers. learning those models means to minimize the following quadratic error:jy(θ(y))==12∫(q(y|x,θ(y))−p(y|x))2p(x)dx12∫q(y|x,θ(y))2p(x)dx−∫q(y|x,θ(y))p(y|x)p(x)dx+12∫p(y|x)2p(x)dx
where p(x) represents the probability density of training set {xi}ni=1.
by the bayesian formula,p(y|x)p(x)=p(x,y)=p(x|y)p(y)
hence jy can be reformulated as
jy(θ(y))=12∫q(y|x,θ(y))2p(x)dx−∫q(y|x,θ(y))p(x|y)p(y)dx+12∫p(y|x)2p(x)dx
note that the first term and second term in the equation above stand for the mathematical expectation of p(x) and p(x|y) respectively, which are often impossible to calculate directly. the last term is independent of θ and thus can be omitted.
due to the fact that p(x|y) is the probability density of sample x belonging to class y, we are able to estimate term 1 and 2 by the following averages:1n∑i=1nq(y|xi,θ(y))2,1ny∑i:yi=yq(y|xi,θ(y))p(y)
next, we introduce the regularization term to get the following calculation rule:j^y(θ(y))=12n∑i=1nq(y|xi,θ(y))2−1ny∑i:yi=yq(y|xi,θ(y))+λ2n∥θ(y)∥2
let π(y)=(π(y)1,…,π(y)n)t and π(y)i={1(yi=y)0(yi≠y), then
j^y(θ(y))=12nθ(y)tΦtΦθ(y)−1nθ(y)tΦtπ(y)+λ2n∥θ(y)∥2
.
therefore, it is evident that the problem above can be formulated as a convex optimization problem, and we can get the analytic solution by setting the twice order derivative to zero:
θ^(y)=(ΦtΦ+λi)−1Φtπ(y)
in order not to get a negative estimation of the posterior probability, we need to add a constrain on the negative outcome:p^(y|x)=max(0,θ^(y)tϕ(x))∑cy′=1max(0,θ^(y′)tϕ(x))
we also take gaussian kernal models for example:
logistic regression is good at dealing with sample set with small size since it works in a simple way. however, when the number of samples is large to some degree, it is better to turn to the least square probability classifiers.