天天看點

Logistic回歸與最小二乘機率分類算法簡述與示例

likelihood function, as interpreted by wikipedia:

<a href="https://en.wikipedia.org/wiki/likelihood_function">https://en.wikipedia.org/wiki/likelihood_function</a>

plays one of the key roles in statistic inference, especially methods of estimating a parameter from a set of statistics. in this article, we’ll make full use of it.

pattern recognition works on the way that learning the posterior probability p(y|x) of pattern x belonging to class y. in view of a pattern x, when the posterior probability of one of the class y achieves the maximum, we can take x for class y, i.e.

y^=argmaxy=1,…,cp(u|x)

the posterior probability can be seen as the credibility of model x belonging to class y.

in logistic regression algorithm, we make use of linear logarithmic function to analyze the posterior probability:

q(y|x,θ)=exp(∑bj=1θ(y)jϕj(x))∑cy′=1exp(∑bj=1θ(y′)jϕj(x))

note that the denominator is a kind of regularization term. then the logistic regression is defined by the following optimal problem:

maxθ∑i=1mlogq(yi|xi,θ)

we can solve it by gradient descent method:

initialize θ.

pick up a training sample (xi,yi) randomly.

update θ=(θ(1)t,…,θ(c)t)t along the direction of gradient ascent:θ(y)←θ(y)+ϵ∇yji(θ),y=1,…,c

where ∇yji(θ)=−exp(θ(y)tϕ(xi))ϕ(xi)∑cy′=1exp(θ(y′)tϕ(xi))+{ϕ(xi)0(y=yi)(y≠yi)

go back to step 2,3 until we get a θ of suitable precision.

take the gaussian kernal model as an example:

q(y|x,θ)∝exp⎛⎝∑j=1nθjk(x,xj)⎞⎠

aren’t you familiar with gaussian kernal model? refer to this article:

<a href="http://blog.csdn.net/philthinker/article/details/65628280">http://blog.csdn.net/philthinker/article/details/65628280</a>

here are the corresponding matlab codes:

Logistic回歸與最小二乘機率分類算法簡述與示例

in ls probability classifiers, linear parameterized model is used to express the posterior probability:

q(y|x,θ(y))=∑j=1bθ(y)jϕj(x)=θ(y)tϕ(x),y=1,…,c

these models depends on the parameters θ(y)=(θ(y)1,…,θ(y)b)t correlated to each classes y that is diverse from the one used by logistic classifiers. learning those models means to minimize the following quadratic error:jy(θ(y))==12∫(q(y|x,θ(y))−p(y|x))2p(x)dx12∫q(y|x,θ(y))2p(x)dx−∫q(y|x,θ(y))p(y|x)p(x)dx+12∫p(y|x)2p(x)dx

where p(x) represents the probability density of training set {xi}ni=1.

by the bayesian formula,p(y|x)p(x)=p(x,y)=p(x|y)p(y)

hence jy can be reformulated as

jy(θ(y))=12∫q(y|x,θ(y))2p(x)dx−∫q(y|x,θ(y))p(x|y)p(y)dx+12∫p(y|x)2p(x)dx

note that the first term and second term in the equation above stand for the mathematical expectation of p(x) and p(x|y) respectively, which are often impossible to calculate directly. the last term is independent of θ and thus can be omitted.

due to the fact that p(x|y) is the probability density of sample x belonging to class y, we are able to estimate term 1 and 2 by the following averages:1n∑i=1nq(y|xi,θ(y))2,1ny∑i:yi=yq(y|xi,θ(y))p(y)

next, we introduce the regularization term to get the following calculation rule:j^y(θ(y))=12n∑i=1nq(y|xi,θ(y))2−1ny∑i:yi=yq(y|xi,θ(y))+λ2n∥θ(y)∥2

let π(y)=(π(y)1,…,π(y)n)t and π(y)i={1(yi=y)0(yi≠y), then

j^y(θ(y))=12nθ(y)tΦtΦθ(y)−1nθ(y)tΦtπ(y)+λ2n∥θ(y)∥2

.

therefore, it is evident that the problem above can be formulated as a convex optimization problem, and we can get the analytic solution by setting the twice order derivative to zero:

θ^(y)=(ΦtΦ+λi)−1Φtπ(y)

in order not to get a negative estimation of the posterior probability, we need to add a constrain on the negative outcome:p^(y|x)=max(0,θ^(y)tϕ(x))∑cy′=1max(0,θ^(y′)tϕ(x))

we also take gaussian kernal models for example:

Logistic回歸與最小二乘機率分類算法簡述與示例

logistic regression is good at dealing with sample set with small size since it works in a simple way. however, when the number of samples is large to some degree, it is better to turn to the least square probability classifiers.