随機抽樣一緻性算法（RANSAC）

本文翻譯自維基百科，英文原文位址是：http://en.wikipedia.org/wiki/ransac，如果您英語不錯，建議您直接檢視原文。

RANSAC是“RANdom SAmple Consensus（随機抽樣一緻）”的縮寫。它可以從一組包含“局外點”的觀測資料集中，通過疊代方式估計數學模型的參數。它是一種不确定的算法——它有一定的機率得出一個合理的結果；為了提高機率必須提高疊代次數。該算法最早由Fischler和Bolles于1981年提出。

RANSAC的基本假設是：

（1）資料由“局内點”組成，例如：資料的分布可以用一些模型參數來解釋；

（2）“局外點”是不能适應該模型的資料；

（3）除此之外的資料屬于噪聲。

局外點産生的原因有：噪聲的極值；錯誤的測量方法；對資料的錯誤假設。

RANSAC也做了以下假設：給定一組（通常很小的）局内點，存在一個可以估計模型參數的過程；而該模型能夠解釋或者适用于局内點。

本文内容

1 示例

2 概述

3 算法

4 參數

5 優點與缺點

6 應用

7 參考文獻

一、示例

一個簡單的例子是從一組觀測資料中找出合适的2維直線。假設觀測資料中包含局内點和局外點，其中局内點近似的被直線所通過，而局外點遠離于直線。簡單的最小二乘法不能找到适應于局内點的直線，原因是最小二乘法盡量去适應包括局外點在内的所有點。相反，RANSAC能得出一個僅僅用局内點計算出模型，并且機率還足夠高。但是，RANSAC并不能保證結果一定正确，為了保證算法有足夠高的合理機率，我們必須小心的選擇算法的參數。

左圖：包含很多局外點的資料集　　　　右圖：RANSAC找到的直線（局外點并不影響結果）

二、概述

RANSAC算法的輸入是一組觀測資料，一個可以解釋或者适應于觀測資料的參數化模型，一些可信的參數。

RANSAC通過反複選擇資料中的一組随機子集來達成目标。被選取的子集被假設為局内點，并用下述方法進行驗證：

有一個模型适應于假設的局内點，即所有的未知參數都能從假設的局内點計算得出。
用1中得到的模型去測試所有的其它資料，如果某個點适用于估計的模型，認為它也是局内點。
如果有足夠多的點被歸類為假設的局内點，那麼估計的模型就足夠合理。
然後，用所有假設的局内點去重新估計模型，因為它僅僅被初始的假設局内點估計過。
最後，通過估計局内點與模型的錯誤率來評估模型。

這個過程被重複執行固定的次數，每次産生的模型要麼因為局内點太少而被舍棄，要麼因為比現有的模型更好而被選用。

三、算法

僞碼形式的算法如下所示：

輸入：

data —— 一組觀測資料

model —— 适應于資料的模型

n —— 适用于模型的最少資料個數

k —— 算法的疊代次數

t —— 用于決定資料是否适應于模型的閥值

d —— 判定模型是否适用于資料集的資料數目

輸出：

best_model —— 跟資料最比對的模型參數（如果沒有找到好的模型，傳回null）

best_consensus_set —— 估計出模型的資料點

best_error —— 跟資料相關的估計出的模型錯誤

iterations = 0
best_model = null
best_consensus_set = null
best_error = 無窮大
while ( iterations < k )
    maybe_inliers = 從資料集中随機選擇n個點
    maybe_model = 适合于maybe_inliers的模型參數
    consensus_set = maybe_inliers

    for ( 每個資料集中不屬于maybe_inliers的點 ）
        if ( 如果點适合于maybe_model，且錯誤小于t ）
            将點添加到consensus_set
    if （ consensus_set中的元素數目大于d ）
        已經找到了好的模型，現在測試該模型到底有多好
        better_model = 适合于consensus_set中所有點的模型參數
        this_error = better_model究竟如何适合這些點的度量
        if ( this_error < best_error )
            我們發現了比以前好的模型，儲存該模型直到更好的模型出現
            best_model =  better_model
            best_consensus_set = consensus_set
            best_error =  this_error
    增加疊代次數
傳回 best_model, best_consensus_set, best_error

RANSAC算法的可能變化包括以下幾種：

（1）如果發現了一種足夠好的模型（該模型有足夠小的錯誤率），則跳出主循環。這樣可能會節約計算額外參數的時間。

（2）直接從maybe_model計算this_error，而不從consensus_set重新估計模型。這樣可能會節約比較兩種模型錯誤的時間，但可能會對噪聲更敏感。

The generic RANSAC algorithm works as follows:

Given:
    data – a set of observed data points
    model – a model that can be fitted to data points
    n – the minimum number of data values required to fit the model
    k – the maximum number of iterations allowed in the algorithm
    t – a threshold value for determining when a data point fits a model
    d – the number of close data values required to assert that a model fits well to data

Return:
    bestfit – model parameters which best fit the data (or nul if no good model is found)

iterations = 0
bestfit = nul
besterr = something really large
while iterations < k {
    maybeinliers = n randomly selected values from data
    maybemodel = model parameters fitted to maybeinliers
    alsoinliers = empty set
    for every point in data not in maybeinliers {
        if point fits maybemodel with an error smaller than t
             add point to alsoinliers
    }
    if the number of elements in alsoinliers is > d {
        % this implies that we may have found a good model
        % now test how good it is
        bettermodel = model parameters fitted to all points in maybeinliers and alsoinliers
        thiserr = a measure of how well model fits these points
        if thiserr < besterr {
            bestfit = bettermodel
            besterr = thiserr
        }
    }
    increment iterations
}
return bestfit

The matlab implementation of 2D line fitting using RANSAC algorithm:

function [bestParameter1,bestParameter2] = ransac_demo(data,num,iter,threshDist,inlierRatio)
 % data: a 2xn dataset with #n data points
 % num: the minimum number of points. For line fitting problem, num=2
 % iter: the number of iterations
 % threshDist: the threshold of the distances between points and the fitting line
 % inlierRatio: the threshold of the number of inliers 
 
 %% Plot the data points
 figure;plot(data(1,:),data(2,:),\'o\');hold on;
 number = size(data,2); % Total number of points
 bestInNum = 0; % Best fitting line with largest number of inliers
 bestParameter1=0;bestParameter2=0; % parameters for best fitting line
 for i=1:iter
 %% Randomly select 2 points
     idx = randperm(number,num); sample = data(:,idx);   
 %% Compute the distances between all points with the fitting line 
     kLine = sample(:,2)-sample(:,1);% two points relative distance
     kLineNorm = kLine/norm(kLine);
     normVector = [-kLineNorm(2),kLineNorm(1)];%Ax+By+C=0 A=-kLineNorm(2),B=kLineNorm(1)
     distance = normVector*(data - repmat(sample(:,1),1,number));
 %% Compute the inliers with distances smaller than the threshold
     inlierIdx = find(abs(distance)<=threshDist);
     inlierNum = length(inlierIdx);
 %% Update the number of inliers and fitting model if better model is found     
     if inlierNum>=round(inlierRatio*number) && inlierNum>bestInNum
         bestInNum = inlierNum;
         parameter1 = (sample(2,2)-sample(2,1))/(sample(1,2)-sample(1,1));
         parameter2 = sample(2,1)-parameter1*sample(1,1);
         bestParameter1=parameter1; bestParameter2=parameter2;
     end
 end
 
 %% Plot the best fitting line
 xAxis = -number/2:number/2; 
 yAxis = bestParameter1*xAxis + bestParameter2;
 plot(xAxis,yAxis,\'r-\',\'LineWidth\',2);

四、參數

我們不得不根據特定的問題和資料集通過實驗來确定參數t和d。然而參數k（疊代次數）可以從理論結果推斷。當我們從估計模型參數時，用p表示一些疊代過程中從資料集内随機選取出的點均為局内點的機率；此時，結果模型很可能有用，是以p也表征了算法産生有用結果的機率。用w表示每次從資料集中選取一個局内點的機率，如下式所示：

w = 局内點的數目 / 資料集的數目

通常情況下，我們事先并不知道w的值，但是可以給出一些魯棒的值。假設估計模型需要標明n個點，wn是所有n個點均為局内點的機率；1 − wn是n個點中至少有一個點為局外點的機率，此時表明我們從資料集中估計出了一個不好的模型。 (1 − wn)k表示算法永遠都不會選擇到n個點均為局内點的機率，它和1-p相同。是以，

1 − p = (1 − wn)k

我們對上式的兩邊取對數，得出

值得注意的是，這個結果假設n個點都是獨立選擇的；也就是說，某個點被標明之後，它可能會被後續的疊代過程重複標明到。這種方法通常都不合理，由此推導出的k值被看作是選取不重複點的上限。例如，要從上圖中的資料集尋找适合的直線，RANSAC算法通常在每次疊代時選取2個點，計算通過這兩點的直線maybe_model，要求這兩點必須唯一。

為了得到更可信的參數，标準偏差或它的乘積可以被加到k上。k的标準偏差定義為：

五、優點與缺點

RANSAC的優點是它能魯棒的估計模型參數。例如，它能從包含大量局外點的資料集中估計出高精度的參數。RANSAC的缺點是它計算參數的疊代次數沒有上限；如果設定疊代次數的上限，得到的結果可能不是最優的結果，甚至可能得到錯誤的結果。RANSAC隻有一定的機率得到可信的模型，機率與疊代次數成正比。RANSAC的另一個缺點是它要求設定跟問題相關的閥值。

RANSAC隻能從特定的資料集中估計出一個模型，如果存在兩個（或多個）模型，RANSAC不能找到别的模型。

六、應用

RANSAC算法經常用于計算機視覺，例如同時求解相關問題與估計立體錄影機的基礎矩陣。

七、參考文獻

Martin A. Fischler and Robert C. Bolles (June 1981). "Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography". Comm. of the ACM 24: 381–395. doi:10.1145/358669.358692.
David A. Forsyth and Jean Ponce (2003). Computer Vision, a modern approach. Prentice Hall. ISBN 0-13-085198-1.
Richard Hartley and Andrew Zisserman (2003). Multiple View Geometry in Computer Vision (2nd ed.). Cambridge University Press.
P.H.S. Torr and D.W. Murray (1997). "The Development and Comparison of Robust Methods for Estimating the Fundamental Matrix". International Journal of Computer Vision 24: 271–300. doi:10.1023/A:1007927408552.
Ondrej Chum (2005). "Two-View Geometry Estimation by Random Sample and Consensus". PhD Thesis. http://cmp.felk.cvut.cz/~chum/Teze/Chum-PhD.pdf
Sunglok Choi, Taemin Kim, and Wonpil Yu (2009). "Performance Evaluation of RANSAC Family". In Proceedings of the British Machine Vision Conference (BMVC). http://www.bmva.org/bmvc/2009/Papers/Paper355/Paper355.pdf.

From: http://www.cnblogs.com/xrwang/archive/2011/03/09/ransac-1.html