given local outlier factors, we can detect the outliers that are always away from most of the samples. in order to outline the algorithm, some concepts must go first:
reachability distance
rdk(x,x′)=max(∥x−x(k)∥,∥x−x′∥)
where x(k) stands for the kth point nearest to x in training set {xi}ni=1. note that k is manually selected.
local reachability density
lrdk(x)=(1k∑i=1krdk(x(i),x))−1
local outlier factor
lofk(x)=1k∑ki=1lrdk(x(i))lrdk(x)
evidently, as the lof of x ascends, the probability that x is an outlier also goes up. theoretically, it is an easy algorithm with intuitive principle. however, when n is a very large number, it also requires tremendous computation amount.
here is a simple example
in unsupervised learning problems, there is usually little information about the outliers. however, when some known normal sample set {x′i′}n′i′=1 is given, we may be confident to figure out the outliers in the test set {xi}ni=1 to some degree.
kullback-leibler (kl) divergence, also known as relative entropy, is a powerful tool to estimate the probability density ratio of normal samples to test samples-
w(x)=p′(x)p(x)
where p′(x) is the probability density of normal samples and p(x) is that of test ones, and avoid direct calculation of the ratio. the ratio of normal sample approaches 1 but of an outlier is away from 1.
to begin with, let transform the model of density ratio to a parameterized linear model:
wα(x)=∑j=1bαjψj(x)=αtψ(x)
where α=(α1,⋯,αb)t is the parameter vector and ψ(x)=(ψ1,⋯,ψb)t is a non-negative basis function vector. then wα(x)p(x) can be seen as an estimation of p′(x). define the similarity between wα(x)p(x) and p′(x) as kl distance, i.e.
kl(p′∥wα(x)p(x))=∫p′(x)logp′(x)wα(x)p(x)
in general case, kl distance is non-negative and equals to zero only if wαp=p′. when kl distance is considerably small, wαp can be regarded near to p′. in order to guarantee that wαp is well-defined, we apply the following constraint
∫wα(x)p(x)dx=1,∀x,wα(x)p(x)≥0
then by approximation, we can transform the estimation above to the following optimal problem:
maxα1n′∑i′=1n′logwα(x′i′)s.t.1n∑i=1nwα(xi)=1,α1,…,αn′≥0
we briefly summarize the estimation process:
initialize α.
repeatedly carry out the following process until α comes a suitable precision:
α←α+ϵat(1./aα)
α←α+(1−btα)b(btb)
α←max(0,α)
α←α/(btα)
where a is the matrix whose (i′,j)th element is ψj(x′i′). b is the vector whose jth element is 1n∑ni=1ψj(xi).
here is an example (gaussian kernal model):
furthermore, outlier detection can be done using support vector machine techniques. due to the time limit, we just outline the main structure of that algorithm.
a typical svm outlier detector gives a hyper-ball that contains nearly all the sample points. then a point which is outlying the hyper-ball can be seen as an outlier. concretely speaking, we get the center c and radius r by solving the following optimal problem:
minc,r,ξ(r2+c∑i=1nξi)s.t.∥xi−c∥2≤r2+ξi,ξi≥0,∀i=1,2,…,n
it can be solved by using lagrange multiplers:
l(c,r,ξ,α,β)=r2+c∑i=1nξi−∑i=1nαi(r2+ξi−∥xi−c∥2)−∑i=1nβiξi
then its dual problem can be formulated as:
maxα,βinfc,r,ξl(c,r,ξ,α,β),s.t.α≥0,β≥0
kkt condition:
∂l∂c=0∂l∂r=0∂l∂ξi=0⇒⇒⇒c=∑ni=1αixi∑ni=1αi∑i=1nαi=1αi+βi=c,∀i=1,2,…,n
therefore, the dual problem can be solved by
α^=argmaxα=⎛⎝∑i=1nαixtixi−∑i,j=1nαiαjxtixj⎞⎠s.t.0≤αi≤c,∀i=1,2,…,n
it is in the form of typical quadratic programming problem. after solving it, we are able to further solve c and r:
r2^=∥∥∥∥xi−∑j=1nα^jxj∥∥∥∥2,c^=∑i=1nα^ixi
where xi is the support vector satisfying ∥xi−c∥2=r2 and 0<αi<c.
hence, when a sample point x satisfies
∥x−c^∥2>r2^
it can be viewed as an outlier.