天天看點

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

http://blog.csdn.net/pipisorry/article/details/44783647

機器學習Machine Learning - Andrew NG courses學習筆記

Anomaly Detection異常檢測

Problem Motivation問題的動機

Anomaly detection example

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Applycation of anomaly detection

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:for Frauddetection: users behavior examples of features of a users activity may be on the website it'd be things like,maybe x1 is how often does this user log in, x2,the number of what pages visited, or the number of transactions, maybe x3 is the number of posts of the users on the forum, feature x4 could be the typing speed of the user.And so you can model p of x based on this sort of data.

Gaussian Distribution高斯分布(正态分布)

高斯分布

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)
Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:

1. the width of this bell-shaped curve,sigma, is also called one standard deviation.sigma代表的是鐘形的寬度。

2. p of x semicolon Mu comma sigma squared denotes that the probability of x is parametrized by the two parameters Mu and sigma squared.

3. p of x plotted as a function of x,for a fixed value of Mu and of sigma squared sigma squared, that's called the variance.

Parameter estimation參數估計

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:

1. suspect that each of these examples was distributed according to a normal or Gaussian distribution with some parameter Mu and some parameter sigma squared.

2. estimate Mu is going to be just the average of my example.So Mu is the mean parameter,

3. these estimates are actually the maximum likelihood estimates of the parameters of Mu and sigma squared.

4. this first term becomes 1 over m minus 1, instead of 1 over m. In machine learning, people tend to use this 1 over m formula.But in practice, whether it is 1 over m or 1 over m minus one, makes essentially no difference, assuming m is reasonably large, it's a large training set size.

Algorithm異常檢測算法

density estimation密度估計

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:

1. model p of x from the data sets.we are going to try to figure out what are high probability features, what are lower probability types of features.

2. this equation actually corresponds to an independence assumption on the values of the features x1 through xn.But in practice it turns out that the algorithm of this fragment, it works just fine,whether or not these features are anywhere close to independent and even if independence assumption doesn't hold true.

3. the problem of estimating this distribution p of x, they're sometimes called the problem of density estimation.

4. 不同的features有不同的mu和mean.

異常檢測算法

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:

1. to choose features that describe general properties of the things that you're collecting data on.

2. mu J just take the mean over my training set of the values of the j feature.

3. 異常的就是說産生新example的features總的機率相當低,發生了就是異常的。

Anomaly detection example

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

{為什麼這裡不用supervised learning, e.g. svm,而是用的anomaly detection: 在後兩節會講到}

Developing and Evaluating an Anomaly Detection System開發和評價異常檢測系統

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:

1. Training set is unlabled, cross validation & test set is labled.

2. 對于異常檢測問題,要檢測出的是anomalous的,是以anomalous對應y = 1

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:

1. we call training set an unlabeled training set but all of these examples are really ones that correspond to y equals 0.so that's our training set of all,good, or the vast majority of good examples.

2. So, we will use these 6000 engines to fit p of x. And so we would these 6,000 examples to estimate the parameters Mu 1, sigma squared 1, up to Mu N, sigma squared N.

3. someone put the same 4000 in the cross validation set and the test set.but we like to think of the cross validation set and the test set as being completely different data sets to each other,it is not considered a good machine learning practice.

Algorithm evaluation算法評估

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:

1. these labels are will be very skewed because y equals zero, that is normal examples, usually be much more common than y equals 1 than anomalous examples.是以要用precion/recall評估,而不能使用classification accuracy。

2. to set epsilon, evaluate the algorithm on the cross validation set, and then when we've picked the set of features, when we've found the value of epsilon, do the final evaluation of the algorithm on the test sets.

Anomaly Detection vs. Supervised Learning異常檢測vs.監督學習

{if we have this labeled data,why don't we just use a supervised learning algorithm logistic regression or a neural network,to try to learn directly from our labeled data, to predict whether y equals one or y equals zero}

分别在哪種情況下使用the properties of a learning problem that cause to treat it as an anomaly detention verses a supervised learning

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:

1. Anomaly Detection:when we are doing the process of estimating p of x, of fitting all those Gaussian parameters,we need only negative examples to do that.So if you have a lot of negative data,we can still fit to p of x pretty well.

2. Anomaly Detection:for anomaly detection applications often there are many different types of anomalies that could go wrong that could break an aircraft engine.it can be difficult for an algorithm to learn from your small set of positive examples what the anomalies look like.And in particular,future anomalies may look nothing like the ones you've seen so far.new way for an aircraft engine to be broken that you have just never seen before,then it might be more promising to just model the negative examples, with a sort of a Gaussian model P of X. Rather than try too hard to model the positive examples.

3. for the SPAM problem, we usually have enough examples of spam email to see,most of these different types of SPAM email, because we have a large set of examples of SPAM, and that's why we usually think of SPAM as asupervised learningsetting, even though, there may be many different types of SPAM.

some applications of anomaly detection versus supervised learning應用上的差別

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note: if you are very a major online retailer, and have had a lot of people try to commit fraud on your website,sometimes fraud detection could actually shift over to the supervised learning column.for some manufacturing processes, if you're manufacturing very large volumes and you've seen a lot of bad examples, maybe manufacturing could shift to the supervised learning column as well.

Choosing What Features to Use選擇使用哪些features

Non-gaussian features轉換成gaussian features

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)
Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)
Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)
Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)
Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:

1. in case your data looks non-Gaussian, the algorithms will often work just find.

2. play with different transformations of the data in order to make it look more Gaussian.

3. more generally with log x with x2 and some constant c and this constant could be something to try to make it look as Gaussian as possible.

4. new feature x_new (0.05) looks more Gaussian than my previous one and then I might instead use this new feature to feed into my anomaly detection algorithm.

5. You could also have hist of log of x, that's another example of a transformation you can use.that also look pretty Gaussian.So, I can also define x_new equals log of x.

come up with features for an anomaly detection algorithm

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:

1. Look at the anomaly that the algorithm is failing to flag, and see if that inspires you tocreate some new feature.so that with this new feature it becomes easier to distinguish the anomalies from your good examples.

2. 綠色x代表anomaly example, 隻有一個feature時會區分錯誤,加一個feature x2時就可以正确區分。

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:

1. I have a very high CPU load, and have a very high network traffic.suspect the failure cases is one of my computers has a job that gets stuck in some infinite loop.and so the CPU load grows,but the network traffic doesn't because it's just spinning it's wheels and doing a lot of CPU work,stuck in some infinite loop.create a new feature, X5,which might be CPU load divided by network traffic.

2. And by creating features like these, you can start to capture anomalies that correspond to unusual combinations of values of the features.

Multivariate Gaussian Distribution (Optional)多變量高斯分布

{sometimes catch some anomalies that the earlier algorithm didn't}

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:

1. most of the data data lies in this region(對應藍色區域内), and so thatgreen cross is pretty far away from any of the data I've seen.It looks like that should be raised as an anomaly.

2. 但對于green cross,p(x1)和p(x2)分别都相對正常,就不會将其判定為anomaly.對應在洋紅色區域内。

Multivariate Gaussian (Normal)distribution多變量高斯分布

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Multivariate Gaussian (Normal)examples

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

——

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)
Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

——

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:

1. Sigma is a covariance matrix and measures the variance or the variability of the features X1 X2.Sigma 對角線是方差,兩邊就是協方差,協方差means兩個features的線性相關度。

2. X1 and X2 tend to be highly correlated with each other for example.to change the off diagonal entries of this covariance matrix.就會出現斜着的橢圓。so increase the off-diagonal entries from .5 to .8, it is more andmore thinly peaked along this sort of x equals y line.

Anomaly Detection using the Multivariate Gaussian Distribution (Optional)用多變量的高斯分布進行異常檢測

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:And you set sigmato be equal to this.And this is actually just like the sigma, when we were using the PCA.

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Relationship to original model單變量和多變量的聯系和差別

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:

1. the original model actually corresponds to a special case of a multivariate Gaussian distribution.this special case is defined by constraining the distribution of p of x, the multivariate a Gaussian distribution of p of x,so that the contours of the probability density function are axis aligned(軸對齊).

2. the multivariate Gaussian distribution,corresponds exactly to the old model, if the covariance matrix sigma, has only 0 elements off the diagonals.

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

Note:

1. the multivariate Gaussian model has a lot of parameters, so this covariance matrix sigma is an n by n matrix,has roughly n squared parameters, because it's a symmetric matrix,it's actually closer to n squared over 2 parameters, but this is a lot of parameters, so you need make sure you have a fairly large value for m, make sure you have enough data to fit all these parameters.

2. m greater than or equal to 10 n would be a reasonable rule of thumb to make sure that you can estimate this covariance matrix sigma reasonably well.

3. in problems where you have a very large training set or m is very large and n is not too large, then themultivariate Gaussian model is well worth considering and may work better as well, and can save you from having to spend your time to manually create extra features in case the anomalies turn out to be captured by unusual combinations of values of the features.

4. covariance matrix sigma non-invertible, they're usually 2 cases for this.Oneis if it's failing to satisfy this m greater than n condition;secondcase is if you have redundant features.if you have 2 features that are the same.if your x1 is just equal to x2. Or if you have redundant features like maybe your features X3 is equal to feature X4, plus feature X5,well X3 doesn't contain any extra information.

Reviews複習

Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)
Machine Learning - XV. Anomaly Detection異常檢測 (Week 9)

from:http://blog.csdn.net/pipisorry/article/details/44783647

ref:《Anomaly Detection with Apache Spark》Spark上的異常檢測

如何利用參數和非參數方法來檢測異常值(I)

如何利用參數和非參數方法來檢測異常值(II)

論文:基于夾角的高維資料異常檢測方法(ABOD)《Angle-based outlier detection in high-dimensional data》HP Kriegel, A Zimek (SIGKDD 2008) 

異常檢測用于社群關鍵言論發現[Beyond Trending Topics: identifying important conversations in communities]

異常檢測在Netflix的應用[Netflix使用的異常伺服器偵測技術]

拓撲異常監測《Topological Anomaly Detection》

開源(R):基于馬氏距離/Cerioli方法的多元異常監測CerioliOutlierDetection

繼續閱讀