本文轉載請注明出處 —— polobymulberry-部落格園
0x00 - 前言
在【AR實驗室】mulberryAR : ORBSLAM2+VVSION末尾提及了iPhone5s真機測試結果,其中ExtractORB函數,也就是提取圖像的ORB特征這一塊耗時很可觀。是以這也是目前需要優化的重中之重。此處,我使用【AR實驗室】mulberryAR :添加連續圖像作為輸入中添加的連續圖像作為輸入。這樣的好處有兩個,一個就是保證輸入一緻,那麼單線程提取特征和并行提取特征兩種方法優化對比就比較有可信度,另一個是可以使用iOS模拟器來跑程式了,因為不需要打開攝像頭的,測試起來相當友善,更有多種機型任你選。
目前對特征提取這部分優化就隻有兩個想法:
- 将特征提取的過程并行化。
- 減少提取的特征點數量。
第二種方法很容易,隻需要在配置檔案中更改提取特征點的數目即可,此處不贅述。本文主要集中第一種方法,初步嘗試将特征提取并行化。
0x01 - ORB-SLAM2特征提取過程耗時分析
ORB-SLAM2中特征提取函數叫做ExtractORB,是Frame類的一個成員函數。用來提取目前Frame的ORB特征點。
// flag是給雙目相機用的,單目相機預設flag為0
// 提取im上的ORB特征點
void Frame::ExtractORB(int flag, const cv::Mat &im)
{
if(flag==0)
// mpORBextractorLeft是ORBextractor對象,因為ORBextractor重載了()
// 是以才會有下面這種用法
(*mpORBextractorLeft)(im,cv::Mat(),mvKeys,mDescriptors);
else
(*mpORBextractorRight)(im,cv::Mat(),mvKeysRight,mDescriptorsRight);
}
從上面代碼可以看出ORB-SLAM2特征提取主要調用的是ORBextractor重載的()函數。我們給該函數重要的幾個部分打點,測試每個部分的耗時。
重要提示-測試代碼執行時間:
測試某段代碼執行的時間有很多種方法,比如:
clock_t begin = clock();
//...
clock_t end = clock();
cout << "execute time = " << (end - begin) / CLOCKS_PER_SEC << "s" << endl;
不過我之前在多線程求和【原】C++11并行計算 — 數組求和中使用上述方法計時,發現這個方法對于多線程計算存在bug。因為目前我是基于iOS平台,是以此處我使用了iOS中計算時間的方式。另外又因為在C++檔案中不能直接使用Foundation元件,是以采用對應的CoreFoundation。
CFAbsoluteTime beginTime = CFAbsoluteTimeGetCurrent();
CFDateRef beginDate = CFDateCreate(kCFAllocatorDefault, beginTime);
// ...
CFAbsoluteTime endime = CFAbsoluteTimeGetCurrent();
CFDateRef endDate = CFDateCreate(kCFAllocatorDefault, endTime);
CFTimeInterval timeInterval = CFDateGetTimeIntervalSinceDate(endDate, beginDate);
cout << "execure time = " << (double)(timeInterval) * 1000.0 << "ms" << endl;
将上述計時代碼插入到operator()函數中,目前函數整體看起來如下,主要是對三個部分進行計時,分别為ComputePyramid、ComputeKeyPointsOctTree和ComputeDescriptors:
void ORBextractor::operator()( InputArray _image, InputArray _mask, vector<KeyPoint>& _keypoints,
OutputArray _descriptors)
{
if(_image.empty())
return;
Mat image = _image.getMat();
assert(image.type() == CV_8UC1 );
// 1.計算圖像金字塔的時間
CFAbsoluteTime beginComputePyramidTime = CFAbsoluteTimeGetCurrent();
CFDateRef computePyramidBeginDate = CFDateCreate(kCFAllocatorDefault, beginComputePyramidTime);
// Pre-compute the scale pyramid
ComputePyramid(image);
CFAbsoluteTime endComputePyramidTime = CFAbsoluteTimeGetCurrent();
CFDateRef computePyramidEndDate = CFDateCreate(kCFAllocatorDefault, endComputePyramidTime);
CFTimeInterval computePyramidTimeInterval = CFDateGetTimeIntervalSinceDate(computePyramidEndDate, computePyramidBeginDate);
cout << "ComputePyramid time = " << (double)(computePyramidTimeInterval) * 1000.0 << endl;
vector < vector<KeyPoint> > allKeypoints;
// 2.計算關鍵點KeyPoint的時間
CFAbsoluteTime beginComputeKeyPointsTime = CFAbsoluteTimeGetCurrent();
CFDateRef computeKeyPointsBeginDate = CFDateCreate(kCFAllocatorDefault, beginComputeKeyPointsTime);
ComputeKeyPointsOctTree(allKeypoints);
//ComputeKeyPointsOld(allKeypoints);
CFAbsoluteTime endComputeKeyPointsTime = CFAbsoluteTimeGetCurrent();
CFDateRef computeKeyPointsEndDate = CFDateCreate(kCFAllocatorDefault, endComputeKeyPointsTime);
CFTimeInterval computeKeyPointsTimeInterval = CFDateGetTimeIntervalSinceDate(computeKeyPointsEndDate, computeKeyPointsBeginDate);
cout << "ComputeKeyPointsOctTree time = " << (double)(computeKeyPointsTimeInterval) * 1000.0 << endl;
Mat descriptors;
int nkeypoints = 0;
for (int level = 0; level < nlevels; ++level)
nkeypoints += (int)allKeypoints[level].size();
if( nkeypoints == 0 )
_descriptors.release();
else
{
_descriptors.create(nkeypoints, 32, CV_8U);
descriptors = _descriptors.getMat();
}
_keypoints.clear();
_keypoints.reserve(nkeypoints);
int offset = 0;
// 3.計算描述子的時間
CFAbsoluteTime beginComputeDescriptorsTime = CFAbsoluteTimeGetCurrent();
CFDateRef computeDescriptorsBeginDate = CFDateCreate(kCFAllocatorDefault, beginComputeDescriptorsTime);
for (int level = 0; level < nlevels; ++level)
{
vector<KeyPoint>& keypoints = allKeypoints[level];
int nkeypointsLevel = (int)keypoints.size();
if(nkeypointsLevel==0)
continue;
// preprocess the resized image
Mat workingMat = mvImagePyramid[level].clone();
GaussianBlur(workingMat, workingMat, cv::Size(7, 7), 2, 2, BORDER_REFLECT_101);
// Compute the descriptors
Mat desc = descriptors.rowRange(offset, offset + nkeypointsLevel);
computeDescriptors(workingMat, keypoints, desc, pattern);
offset += nkeypointsLevel;
// Scale keypoint coordinates
if (level != 0)
{
float scale = mvScaleFactor[level]; //getScale(level, firstLevel, scaleFactor);
for (vector<KeyPoint>::iterator keypoint = keypoints.begin(),
keypointEnd = keypoints.end(); keypoint != keypointEnd; ++keypoint)
keypoint->pt *= scale;
}
// And add the keypoints to the output
_keypoints.insert(_keypoints.end(), keypoints.begin(), keypoints.end());
}
CFAbsoluteTime endComputeDescriptorsTime = CFAbsoluteTimeGetCurrent();
CFDateRef computeDescriptorsEndDate = CFDateCreate(kCFAllocatorDefault, endComputeDescriptorsTime);
CFTimeInterval computeDescriptorsTimeInterval = CFDateGetTimeIntervalSinceDate(computeDescriptorsEndDate, computeDescriptorsBeginDate);
cout << "ComputeDescriptors time = " << (double)(computeDescriptorsTimeInterval) * 1000.0 << endl;
}
此時,使用iPhone7模拟器運作mulberryAR,并且運作我之前錄制的一段連續圖像幀,得到結果如下(此處我隻截取前三幀的結果):

可以看出優化的重點在于ComputeKeyPointsOctTree、ComputeDescriptiors。
0x02 - ORB-SLAM2特征提取優化思路
ComputePyramid、ComputeKeyPointsOctTree和ComputeDescriptors函數中都會根據圖像金字塔的不同層級做同樣的操作,是以此處可以将圖像金字塔不同層級的操作并行化。按照這個思路,對三個部分的代碼進行了修改。
1.ComputePyramid函數并行化
該函數暫時無法進行并行化處理,因為裡面在計算圖像金字塔中第n層圖像的時候,依賴第n-1層的圖像,另外此函數在整個特征提取的部分占比不是很大,相對來說并行化意義不是很大。
2.ComputeKeyPointsOctTree函數并行化
該函數的并行化過程很容易,隻需要将其中的for(int i = 0; i < nlevels; ++i)裡面的函數做成單獨函數,并添加到各自的thread中即可。不廢話,直接上代碼:
void ORBextractor::ComputeKeyPointsOctTree(vector<vector<KeyPoint> >& allKeypoints)
{
allKeypoints.resize(nlevels);
vector<thread> computeKeyPointsThreads;
for (int i = 0; i < nlevels; ++i) {
computeKeyPointsThreads.push_back(thread(&ORBextractor::ComputeKeyPointsOctTreeEveryLevel, this, i, std::ref(allKeypoints)));
}
for (int i = 0; i < nlevels; ++i) {
computeKeyPointsThreads[i].join();
}
// compute orientations
vector<thread> computeOriThreads;
for (int level = 0; level < nlevels; ++level) {
computeOriThreads.push_back(thread(computeOrientation, mvImagePyramid[level], std::ref(allKeypoints[level]), umax));
}
for (int level = 0; level < nlevels; ++level) {
computeOriThreads[level].join();
}
}
其中ComputeKeyPointsOctTreeEveryLevel函數如下:
void ORBextractor::ComputeKeyPointsOctTreeEveryLevel(int level, vector<vector<KeyPoint> >& allKeypoints)
{
const float W = 30;
const int minBorderX = EDGE_THRESHOLD-3;
const int minBorderY = minBorderX;
const int maxBorderX = mvImagePyramid[level].cols-EDGE_THRESHOLD+3;
const int maxBorderY = mvImagePyramid[level].rows-EDGE_THRESHOLD+3;
vector<cv::KeyPoint> vToDistributeKeys;
vToDistributeKeys.reserve(nfeatures*10);
const float width = (maxBorderX-minBorderX);
const float height = (maxBorderY-minBorderY);
const int nCols = width/W;
const int nRows = height/W;
const int wCell = ceil(width/nCols);
const int hCell = ceil(height/nRows);
for(int i=0; i<nRows; i++)
{
const float iniY =minBorderY+i*hCell;
float maxY = iniY+hCell+6;
if(iniY>=maxBorderY-3)
continue;
if(maxY>maxBorderY)
maxY = maxBorderY;
for(int j=0; j<nCols; j++)
{
const float iniX =minBorderX+j*wCell;
float maxX = iniX+wCell+6;
if(iniX>=maxBorderX-6)
continue;
if(maxX>maxBorderX)
maxX = maxBorderX;
vector<cv::KeyPoint> vKeysCell;
FAST(mvImagePyramid[level].rowRange(iniY,maxY).colRange(iniX,maxX),
vKeysCell,iniThFAST,true);
if(vKeysCell.empty())
{
FAST(mvImagePyramid[level].rowRange(iniY,maxY).colRange(iniX,maxX),
vKeysCell,minThFAST,true);
}
if(!vKeysCell.empty())
{
for(vector<cv::KeyPoint>::iterator vit=vKeysCell.begin(); vit!=vKeysCell.end();vit++)
{
(*vit).pt.x+=j*wCell;
(*vit).pt.y+=i*hCell;
vToDistributeKeys.push_back(*vit);
}
}
}
}
vector<KeyPoint> & keypoints = allKeypoints[level];
keypoints.reserve(nfeatures);
keypoints = DistributeOctTree(vToDistributeKeys, minBorderX, maxBorderX,
minBorderY, maxBorderY,mnFeaturesPerLevel[level], level);
const int scaledPatchSize = PATCH_SIZE*mvScaleFactor[level];
// Add border to coordinates and scale information
const int nkps = keypoints.size();
for(int i=0; i<nkps ; i++)
{
keypoints[i].pt.x+=minBorderX;
keypoints[i].pt.y+=minBorderY;
keypoints[i].octave=level;
keypoints[i].size = scaledPatchSize;
}
}
在iPhone7模拟器上測試,得到如下結果(取前5幀圖像測試):
可以看到通過并行處理,ComputeKeyPointsOctTree獲得了2~3倍的提速。
3.ComputeDescriptors部分并行化
之是以這一部分叫做“部分”,而非“函數”是因為這部分涉及的函數相對于ComputeKeyPointsOctTree比較複雜,涉及的變量比較多。隻有理清之間的關系才能安全地并行化。
此處也不贅述,直接貼出修改後的并行化代碼:
vector<thread> computeDescThreads;
vector<vector<KeyPoint> > keypointsEveryLevel;
keypointsEveryLevel.resize(nlevels);
// 圖像金字塔每層的offset與前面每層的offset有關,是以不能直接放在ComputeDescriptorsEveryLevel計算
for (int level = 0; level < nlevels; ++level) {
computeDescThreads.push_back(thread(&ORBextractor::ComputeDescriptorsEveryLevel, this, level, std::ref(allKeypoints), descriptors, offset, std::ref(keypointsEveryLevel[level])));
int keypointsNum = (int)allKeypoints[level].size();
offset += keypointsNum;
}
for (int level = 0; level < nlevels; ++level) {
computeDescThreads[level].join();
}
// _keypoints要按照順序進行插入,是以不能直接放在ComputeDescriptorsEveryLevel計算
for (int level = 0; level < nlevels; ++level) {
_keypoints.insert(_keypoints.end(), keypointsEveryLevel[level].begin(), keypointsEveryLevel[level].end());
}
// 其中ComputeDescriptorsEveryLevel函數如下
void ORBextractor::ComputeDescriptorsEveryLevel(int level, std::vector<std::vector<KeyPoint> > &allKeypoints, const Mat& descriptors, int offset, vector<KeyPoint>& _keypoints)
{
vector<KeyPoint>& keypoints = allKeypoints[level];
int nkeypointsLevel = (int)keypoints.size();
if(nkeypointsLevel==0)
return;
// preprocess the resized image
Mat workingMat = mvImagePyramid[level].clone();
GaussianBlur(workingMat, workingMat, cv::Size(7, 7), 2, 2, BORDER_REFLECT_101);
// Compute the descriptors
Mat desc = descriptors.rowRange(offset, offset + nkeypointsLevel);
computeDescriptors(workingMat, keypoints, desc, pattern);
// offset += nkeypointsLevel;
// Scale keypoint coordinates
if (level != 0)
{
float scale = mvScaleFactor[level]; //getScale(level, firstLevel, scaleFactor);
for (vector<KeyPoint>::iterator keypoint = keypoints.begin(),
keypointEnd = keypoints.end(); keypoint != keypointEnd; ++keypoint)
keypoint->pt *= scale;
}
// And add the keypoints to the output
// _keypoints.insert(_keypoints.end(), keypoints.begin(), keypoints.end());
_keypoints = keypoints;
}
可以看到通過并行處理,ComputeDescriptors獲得了2~3倍的提速。
0x03 - 并行化結果分析
0x02小節已經對比了每步優化的結果。此處從整體的角度對結果進行簡單的分析。使用iPhone7模拟器跑了前5幀的對比結果:
從結果中可以看出,ORB特征提取速度有了2~3倍的提升,在TrackMonocular部分占比也下降了不少,暫時ORB特征提取不用作為性能優化的重點。後面将會從其他方面對ORB-SLAM2進行優化。