局部异常因子与KL散度异常检测算法简述

given local outlier factors, we can detect the outliers that are always away from most of the samples. in order to outline the algorithm, some concepts must go first:

reachability distance

rdk(x,x′)=max(∥x−x(k)∥,∥x−x′∥)

where x(k) stands for the kth point nearest to x in training set {xi}ni=1. note that k is manually selected.

local reachability density

lrdk(x)=(1k∑i=1krdk(x(i),x))−1

local outlier factor

lofk(x)=1k∑ki=1lrdk(x(i))lrdk(x)

evidently, as the lof of x ascends, the probability that x is an outlier also goes up. theoretically, it is an easy algorithm with intuitive principle. however, when n is a very large number, it also requires tremendous computation amount.

here is a simple example

in unsupervised learning problems, there is usually little information about the outliers. however, when some known normal sample set {x′i′}n′i′=1 is given, we may be confident to figure out the outliers in the test set {xi}ni=1 to some degree.

kullback-leibler (kl) divergence, also known as relative entropy, is a powerful tool to estimate the probability density ratio of normal samples to test samples-

w(x)=p′(x)p(x)

where p′(x) is the probability density of normal samples and p(x) is that of test ones, and avoid direct calculation of the ratio. the ratio of normal sample approaches 1 but of an outlier is away from 1.

to begin with, let transform the model of density ratio to a parameterized linear model:

wα(x)=∑j=1bαjψj(x)=αtψ(x)

where α=(α1,⋯,αb)t is the parameter vector and ψ(x)=(ψ1,⋯,ψb)t is a non-negative basis function vector. then wα(x)p(x) can be seen as an estimation of p′(x). define the similarity between wα(x)p(x) and p′(x) as kl distance, i.e.

kl(p′∥wα(x)p(x))=∫p′(x)logp′(x)wα(x)p(x)

in general case, kl distance is non-negative and equals to zero only if wαp=p′. when kl distance is considerably small, wαp can be regarded near to p′. in order to guarantee that wαp is well-defined, we apply the following constraint

∫wα(x)p(x)dx=1,∀x,wα(x)p(x)≥0

then by approximation, we can transform the estimation above to the following optimal problem:

maxα1n′∑i′=1n′logwα(x′i′)s.t.1n∑i=1nwα(xi)=1,α1,…,αn′≥0

we briefly summarize the estimation process:

initialize α.

repeatedly carry out the following process until α comes a suitable precision:

α←α+ϵat(1./aα)

α←α+(1−btα)b(btb)

α←max(0,α)

α←α/(btα)

where a is the matrix whose (i′,j)th element is ψj(x′i′). b is the vector whose jth element is 1n∑ni=1ψj(xi).

here is an example (gaussian kernal model):

furthermore, outlier detection can be done using support vector machine techniques. due to the time limit, we just outline the main structure of that algorithm.

a typical svm outlier detector gives a hyper-ball that contains nearly all the sample points. then a point which is outlying the hyper-ball can be seen as an outlier. concretely speaking, we get the center c and radius r by solving the following optimal problem:

minc,r,ξ(r2+c∑i=1nξi)s.t.∥xi−c∥2≤r2+ξi,ξi≥0,∀i=1,2,…,n

it can be solved by using lagrange multiplers:

l(c,r,ξ,α,β)=r2+c∑i=1nξi−∑i=1nαi(r2+ξi−∥xi−c∥2)−∑i=1nβiξi

then its dual problem can be formulated as:

maxα,βinfc,r,ξl(c,r,ξ,α,β),s.t.α≥0,β≥0

kkt condition:

∂l∂c=0∂l∂r=0∂l∂ξi=0⇒⇒⇒c=∑ni=1αixi∑ni=1αi∑i=1nαi=1αi+βi=c,∀i=1,2,…,n

therefore, the dual problem can be solved by

α^=argmaxα=⎛⎝∑i=1nαixtixi−∑i,j=1nαiαjxtixj⎞⎠s.t.0≤αi≤c,∀i=1,2,…,n

it is in the form of typical quadratic programming problem. after solving it, we are able to further solve c and r:

r2^=∥∥∥∥xi−∑j=1nα^jxj∥∥∥∥2,c^=∑i=1nα^ixi

where xi is the support vector satisfying ∥xi−c∥2=r2 and 0<αi<c.

hence, when a sample point x satisfies

∥x−c^∥2>r2^

it can be viewed as an outlier.

局部异常因子与KL散度异常检测算法简述

继续阅读

【趋高机器视觉】机器视觉技术原理解析及解决方案

vsftpd dead but subsys locked 的解决方法

CSMA/CD1． CSMA/CD的概述2． CSMA 的工作原理3． CSMA/CD控制规程及特点4． CSMA/CD协议5． CSMA/CD的优点6．结束语

极大似然法(ML)与最大期望法(EM)

C++ 第十五周报告1--《冒泡法排序》

Shell编程——sort排序、uniq忽略重复、tr替换压缩删除、cut指定删除字段、正则表达式元字符sort 命令uniq 命令tr 命令cut 命令正则表达式

笔试面试题目：滑动窗口(二)

Linxu常用命令技巧汇总

httpd服务的部署、启动、配置和简单优化一、部署二、启动三、配置文件

《Linux命令行与Shell脚本编程大全第2版.布卢姆》pdf

数据结构与算法（27）——排序（二）

nginx 安装错误信息解决

Dijkstra--简易版（最短路径）

Ambari介绍和架构原理

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

hdu7108哈希