laitimes

An article to understand the visual global positioning technology based on feature points in automatic driving

--Collect the "Automotive Driving Automation Classification" (GB/T 40429-2021)--

In unmanned driving, perception, positioning, planning decision-making, and control are the four basic system modules. Since current algorithms cannot achieve absolute intelligence, a lot of prior knowledge is still needed to improve module performance and robustness to achieve safe autonomous driving. Among them, high-precision maps are the integration of prior knowledge of roads and surrounding environments. The accurate positioning based on the map is an important basis for judging the driving situation, providing strong support for subsequent perception and planning decisions.

The main data sources for positioning are currently GPS, lidar, vision, and millimeter wave radar. For vision, although there is no set of positioning schemes recognized in the industry that are sufficiently reliable, the exploration in this regard has never stopped, the main reasons are as follows:

Safety is the most important indicator of unmanned driving systems, so most of the functions are implemented by coupling multi-source data and different algorithm results. No sensor solution is perfect, such as GPS RTK as a widely used solution, easily affected by satellite conditions, weather conditions, data link transmission conditions, in tunnels, indoors and high-rise areas can not be used. In addition, although lidar has the advantages of small operation, providing depth information, and not being affected by light, the information is sparse, the cost is still very expensive, and it does not have the ability to assemble large-scale vehicles. In comparison, the visual information provided by the camera, although it will be affected by light and weather, but the cost is low, the content is rich, is the main data source of the current assisted driving program, and also has great potential in map positioning.

Since the core ideas of the mainstream visual positioning algorithm are in the same vein, this paper only introduces the most commonly used feature-based global positioning algorithm in practice from the perspective of a series of important algorithm framework components, that is, positioning under the map coordinate system. This paper omits the optimization and geometric constraint formula derivations involved in it, and aims to give students a macro introduction to the positioning algorithm, and the specific details can refer to relevant literature and books.

Global localization algorithm based on feature points

Visual global positioning refers to the 6 Degree of freedom (DoF) poses of the camera in the map coordinate system, that is, (x, y, z) coordinates, and the angle deflection (yaw, pitch, roll) around the three coordinate axes based on the current image. At present, it can be mainly classified as a method based on 3D structure, a method based on 2D images, a method based on a sequence image, and a method based on deep learning. Among them, the deep learning-based method belongs to the end-to-end (End-to-end) method, while other multi-stage non-end-to-end methods have different processes, but the algorithm ideas are mostly fig. 1 shown:

An article to understand the visual global positioning technology based on feature points in automatic driving

Figure 1: Based on the query image, the 2D-3D transformation matrix is calculated to solve the camera pose

Based on the established map, match the most similar subset of maps in history (image/point cloud/feature point), calculate the transformation matrix between point pairs according to the true value of the historical pose and the truth value of the coordinates of the feature point provided by the matched map subset, and solve the current camera pose.

Therefore, its core includes four aspects: image description, map query, feature matching, and posture calculation. This is only a macro classification at the technical level, and the actual algorithmic framework is not necessarily implemented in this order, and scholars have mainly improved these technologies in their research. Overall, feature point-based image descriptions are basically mature and less developed. The method of posture calculation is also relatively fixed because it is an optimization problem based on geometric constraints. In contrast, there are more improvement techniques in map query and feature matching. Depending on the data source, the mapping query and matching can be 2D-2D, 2D-3D, 3D-3D. 2D images are obtained by the camera, and 3D point clouds can be generated by binocular cameras that provide depth, RGB-D cameras.

Feature point extraction

The 2D image itself is a matrix of brightness and color, which is sensitive to viewing angles, lighting, tonal changes, etc., and is difficult to use directly. Therefore, the relevant calculations are generally performed using representative points. One would expect such points to have advantages such as rotation, translation, scale, and light invariance. These points are called feature points of the image and contain two parts: key-points and descriptors. Key points express the location of feature points, while descriptors are descriptions of the visual characteristics of feature points, mostly in vector form. In general, the descriptor is mainly a pattern that statistically counts the grayscale/color gradient changes around the key point. A robust descriptor that should have a smaller distance for the descriptor of the same feature point in different situations in different images.

Descriptors are generally hand-crafted features. Classic descriptions such as HOG (Histogram of oriented gradients)[1], SIFT(Scale-invariant feature transform)[2], SURF(Speeded up robust features)[3], AKAZE (Accelerated KAZE)[4] etc.

For real-time requirements, some faster computation binary pattern descriptors have been designed, such as LBP (Local binary patterns)[5], BRIEF (Binary robust independent elementary features), ORB (Oriented FAST and rotated BRIEF)[6], BRISK (Binary robust invariant scalable key-point)[7], FREAK(Fast retina key-point)[8] etc.

Before the popularity of deep learning, these manual features have been leading the entire computational vision industry, and to this day, these features are still widely used in scenarios that lack label data and have more constraints. The following is a brief introduction to two commonly used descriptors.

SIFT

SIFT descriptors are arguably one of the most influential technologies in the CV world. From the critical point detection level, the Difference of Gaussian (DoG) method is mainly used to detect extreme points in multi-scale space as key points. Babaud et al. [9] proved that Gaussian smoothing is the only one that can smooth filter nuclei with multi-scale space, providing sufficient theoretical support for related methods.

So why can such an approach find the key point of the feature?

Since gaussian nuclei can scale images to different scale spaces by blurring, smooth areas with small gradient variations have smaller differences in values at different scale spaces. In contrast, areas such as edges, points, corners, textures, etc. are more widely spaced. In this way, by making differences between images of adjacent scales, it is possible to finally calculate the extremum points of multi-scale space. However, different image details are themselves at different scales. For example, in a portrait of a person, the face may be smoothed into a piece after a small blur, while the corners of the frame may require a larger degree of smoothing to reflect the local "extreme value".

Therefore, as Fig. 2 As shown, the images are first grouped (Octave) using an image pyramid, and each group is then used with a different scale of Gaussian nuclei to form a series of layers. This approach works better than simply using gaussian nuclei of more scales, which can detect more feature points. It should be noted that although SIFT uses DoG for key point detection, other detection methods are also feasible and do not affect the establishment of SIFT descriptors.

An article to understand the visual global positioning technology based on feature points in automatic driving

Figure 2: Gaussian differential method

The descriptor of the SIFT feature point can be understood as a simple statistical version of the HOG. Such as Fig. As shown in 3, the area around 16 × 16 is selected as the center of the detected key point, and the area is reorganized into 4 blocks (patches) of 4 × 4. For each block, a histogram of 8-bins is used to count the gradient, the gradient direction determines which bin falls into, and the gradient modulus determines the size of the value. In order to ensure scale consistency, the gradient size needs to be normalized. To ensure rotational invariance, a principal direction is calculated from all gradients in the area of 16 × 16, and all gradients rotate in the principal direction. The result is a 128-dimensional vector of 4 × 4 × 8.

An article to understand the visual global positioning technology based on feature points in automatic driving

Figure 3: SIFT descriptor based on gradient chunk statistics

Binary description sub

Although after SIFT was proposed, some improved algorithms such as SURF, AKAZE, etc. have been produced, even in today's 2019, it is still difficult to ensure the real-time requirements of some scenarios for algorithms. For example, handheld devices generally have limited computing power. In unmanned driving, CPU and GPU resources need to be scheduled by multiple compute-intensive modules at the same time. Therefore, efficiency is an important indicator to examine the practicality of the algorithm.

In order to improve efficiency, some binary descriptors have been proposed by scholars. In general, these methods are sampling around the key points of the feature. The grayscale size of a pair of points is then compared, and the result is represented by 0/1, forming an N-dimensional binary description vector that constitutes a binary pattern of feature points. The biggest difference between different binary descriptors is mainly due to the different characteristic sampling modes and different point pair selection methods.

Figure 4: LBP describes the subsampling mode

Such as Fig. As shown in 4, the LBP descriptor employs a scheme of loop sampling around the key point and comparing it with the grayscale of the central key point. The grayscale comparison is shown on the ring, with a black dot of 0 and a white dot of 1. LBP is the simplest form of the binary descriptor, while ORB improves the BRIEF feature and is currently the more commonly used binary descriptor. Such as Fig. 5 As shown, in the selection of pairs of points, unlike simply using the center point, the ORB adopts a random approach to describe the local details more comprehensively. However, the correlation of point pairs will be larger, thus reducing the Discriminativeness of the descriptor. ORB directly uses the greed method and the exhaustive method to solve this problem, looking for random pairs with low correlation.

An article to understand the visual global positioning technology based on feature points in automatic driving

Figure 5: ORB describes the subpoint pair selection mode

The sampling method and point pair selection method of the above binary descriptors are in line with people's general intuition, while the descriptors such as BRISK and FREAK provide a more regular and self-scale information for the construction of binary patterns. For example, the FREAK descriptor mimics the visual sampling pattern of the human eye. Such as Fig. 6 As shown, the value of each sample point is the mean grayscale within the red circle, and the blue line indicates the point pair selection scheme.

An article to understand the visual global positioning technology based on feature points in automatic driving

Figure 6: FREAK describes subsampling, point-to-point selection

The high efficiency of the binary descriptor is mainly reflected in three aspects.

(1) The binary descriptor uses a binary vector as a feature description, and only needs to compare the size of the pair of points without calculating the specific gradient.

(2) Comparison between two descriptors can use a Faster and Easier Optimization Hamming Distance.

(3) Since each binary vector corresponds to a decimal number, it also substitutes a table mode instead of using a histogram as in sift.

The general discriminativity of binary descriptors is not as good as that of SIFT family descriptors, but in specific scenarios, with parallel programming, the efficiency can be tens or even hundreds of times higher while ensuring similar discriminative capabilities.

Database creation and querying

Databases can be understood as an integration of maps + indexes. A map can consist of a pure 2D image, a 3D point cloud map, or a combination of a 2D image and a 3D point cloud. 3D point cloud map generation mainly uses the three-dimensional reconstruction method SfM (Structure from motion), which extrapolates 3D information from 2D images of time series. If binocular, RGB-D cameras provide depth, more accurate 3D point information can be obtained. It also contains some selection strategies such as key-frames, which are beyond the scope of this article, and interested students can consult the relevant information on their own. The role of the database is to:

For an input observation image, through the database, query the mapping history (image/point cloud/feature point), obtain the most likely atlas subset (image/point cloud/feature point) of the current image, match the map with the observation information, calculate the transformation matrix, and obtain the position of the observation camera.

Indexing is the key to speeding up this process. The database itself tends to be huge. Taking meituan's small bag robot as an example of the trial operation on the second floor of Beijing Chaoyang Joy City, three depth cameras are installed, and even after screening, nearly 80,000 pictures of 900 × 600 are used. Considering the real-time nature required for positioning, it is impossible to compare 80,000 images with each query, so it is necessary to use indexing technology to speed up the entire algorithm. This aspect of the technology is highly overlapping with loopback testing in SLAM, image retrieval in vision, position recognition, etc., and only the general methods are described below.

There are several feature points in an image, and the feature points need to be encoded first, such as VLAD (Vector of locally aggregated descriptors) encoding, and the local description sub-is used to form a global description of the image. Indexes, such as kd-tree, are then used for image-level queries. Of course, encoding and indexing can also be done at the same time, such as the Bag-of-words model (BoW) + forward index + reverse index method.

VLAD encoding

VLAD(Vector of locally aggregated descriptors)[10], as Fig. 7 is a simple way to aggregate the local description subforming codebook and calculate the distance between the descriptor and the word by accumulating it, and performing global encoding. A dimension descriptor is encoded by the codebook of a code word, and a descriptive vector of dimensions can be formed, and the value in the vector is the difference between the descriptor and the first code word in the dimension. Normalization is then performed to form the final VLAD vector.

An article to understand the visual global positioning technology based on feature points in automatic driving

Figure 7: VLAD is encoded by describing the distance of the sub-word from the code word

Special mention should be made here of DenseVLAD[11] and NetVLAD[12]. Torii et al. proved that DenseSIFT is superior to standard SIFT in both querying and matching. DenseVLAD extracts SIFT points at four scales in a grid-like sampling mode of 2 pixel intervals. Randomly sample 25M descriptors globally and generate a codebook of 128 code words using the k-means algorithm. VLAD vectors are normalized using PCA(Principal component analysis) dimensionality reduction to form the denseVLAD vector of the last 4096 dimensions. Such as Fig. As shown in 8, the number of interior points (green) after matching using SenseSIFT is more.

An article to understand the visual global positioning technology based on feature points in automatic driving

Figure 8: DenseSIFT and standard SIFT feature points, after matching the inner point (green) comparison

NetVLAD, on the other hand, adds supervision information to the VLAD to strengthen the discriminativity of the VLAD code. Such as Fig. As shown in 9, suppose that the red and green descriptors are derived from two images that should not be matched together. Because they are all large in radius and similar distance from the center of VLAD (×), they are coded with similar values after L2 normalization. By adding the monitoring information of the images that do not match the corresponding images of the red and green descriptors, the center point ( ★ ) generated by NetVLAD can better distinguish the two descriptors and increase the distance (radius) difference after their encoding.

An article to understand the visual global positioning technology based on feature points in automatic driving

Figure 9: NetVLAD Cluster Centers (×) vs. VLAD Cluster Centers (★).

BoW encoding + index

Feature coding and design ideas based on the bag-of-words model BoW[13, 14] have played a pivotal role in the development of computer vision and will not be introduced here. This article takes the 2D query image matching 2D image database as an example to introduce a common BoW encoding and index integration model. Such as Fig. As shown in 10, Vocabulary generation takes a hierarchical approach, spatially dividing all the descriptors in the dataset by a tree structure, each layer being computed by k-means clusters. The final leaf node is equivalent to a code word (fig. 10 has 9 code words).

An article to understand the visual global positioning technology based on feature points in automatic driving

Figure 10: Hierarchical BoW model with forward indexes and reverse indexes

The process of tree construction is actually the process of encoding the original image. However, encoding by itself does not speed up the search process, and similar to VLADs, it still needs to be compared with the images in the database one by one. Therefore, an Inverse index is designed here that does not require comparison of encoded vectors. Its principles such as Fig. As shown in 11, for a query image, the extracted description sub-is entered into the BoW and eventually falls into the Visual word k. Each code word corresponds to an index, and the code word is recorded for the weight of the first figure in the database (Fig.10). Here the weights are calculated using TF-IDF (Term frequency–inverse document frequency). That is, if a word appears more frequently in one image and less frequently in other images, the word has better image discrimination and higher weight values. Finally, the matching images are selected through the voting mechanism. It is also important to note that an inverse index is not necessarily built on a tree-structured BoW, it is simply a way to provide a quick query.

An article to understand the visual global positioning technology based on feature points in automatic driving

Figure 11: Query images directly through the reverse indexing + voting mechanism

The role of the Direct Index is mainly to record which nodes the feature points of the database picture fall into when constructing the BoW, so that when the image is queried, there is no need to calculate the feature points, and the feature points can be extracted directly through the index.

3D point cloud query

In 2D image query, the image is queried from the semantic level, so the spatial extent of the feature point can be constrained by the image. 3D point cloud queries do not have such constraints, so they have many difficulties. If you need to consider spatial continuity, whether the queried points are within the observable range, etc. Here's just a look at Sattler's methodology published at TPAMI 2016 [15], which has been polished over the years to make the methodology framework relatively concise and well-established. Because the dictionary encoding search steps overlap with the previous section, only the active search and Visbility Filtering mechanisms are described here.

Active Search is primarily designed to make the matched 3D points as spatially close and geometrically meaningful as possible. Such as Fig. As shown in 12, the red dot matches to a point in the point cloud through a series of coding and refining processes (red lines). According to the proposed Prioritization framework, a 3D point with the highest probability is found from the point cloud and the reverse (blue line) matches one of the corresponding 2D points in the query image.

Figure 12: Active Search

Figure 13: Visbility Filtering

Visbility Filtering is mainly to make the matched point as possible to be observed by the camera (the positioning is unsupervised, and it is not known whether the matched point is correct). The approach taken here is to create a Bipartite visibility graph when using SfM to create a 3D point cloud map. Such as Fig. 13 (left) shows that when a point can be observed by two cameras at the same time, a topological relationship is established. Fig. In 13 (center), the blue dot is the matching point, and they conflict from the viewing point. By clustering the diagram on an existing topology, the camera is grouped in pairs, such as Fig. 13 (right). This makes it possible to generate new graph topology relationships. Then, by judging the coincidence between each subgraph, filter out the points that are most likely invisible.

It should be noted that although binocular cameras and RGB-D cameras can obtain depth, and querying 2D images can also obtain 3D feature coordinates within a limited range, due to current technical limitations, the depth is not reliable in indoor materials and outdoor large-scale scenes. Therefore, matching 2D image points and 3D point cloud maps is still an important method.

Feature point matching

The feature point matching process can be adaptively done in database queries, which is more common in queries based on 3D structures. Matching can also be done separately after the query, most commonly in queries based on 2D images. The purpose of feature matching is to provide a matching set of point pairs for subsequent transformation matrix calculations and to achieve the solution of poses.

Classic RANSAC

The Random sample consensus (RANSAC)[16] is a classic data filtering and parameter fitting algorithm. It assumes that the distribution of the data (inliers) conforms to a certain mathematical model, and through iterative calculations, the exclusion points (Outliers) and noise points are removed, and the probabilistic optimal model parameters are obtained. In global positioning, the inner point refers to the correct match, the outer point refers to the wrong match, and the parametric model refers to the spatial transformation matrix of the matching point pair. Such as Fig. As shown in 14, after optimization by the RANSAC algorithm, the matching is more reasonable. The matching subset that RANSAC expects to find needs to meet two metrics: the least possible inward point reprojection error and as many as possible number of interior points. So the basic process is as follows:

Sample the initial subset.

Computes the transformation matrix.

Calculates the reprojection error of the matching points based on the transformation matrix.

Remove points with large errors

Loop - Preserves the matching scheme that best meets the metric.

An article to understand the visual global positioning technology based on feature points in automatic driving

Figure 14: (Top) Original feature matching; (Bottom) Optimization of the RANSAC algorithm

Wherein, the initial candidate match is generated based on the distance between the descriptors, but the reprojection error is only related to the spatial position of the key point, not to the descriptor itself. For more information about the projection matrix method, please refer to "2.4 Pose Calculation". It should be pointed out that the RANSAC algorithm is affected by the original matching error and parameter selection, which can only ensure that the algorithm has a high enough probability to be reasonable, and does not necessarily get the optimal result. Algorithm parameters mainly include thresholds and number of iterations. The probability that RANSAC obtains a trusted model is proportional to the number of iterations, and the number of matches obtained is inversely proportional to the threshold. Therefore, when actually used, it may be necessary to repeatedly try different parameter settings to get better results.

Scholars have made many improvements to the classical RANSAC algorithm, such as Fig. As shown in 15, the structure diagram of the global RANSAC (Universal- RANSAC)[17] is proposed, forming a universal RANSAC architecture that covers almost all aspects of RANSAC improvements, such as pre-filtering, sampling of the smallest subset, generation of reliable models from the smallest subsets, parameter validation, and model refinement.

An article to understand the visual global positioning technology based on feature points in automatic driving

Figure 15: Universal-RANSAC Universal Algorithm Framework

Differentiatable RANSAC

Because manual descriptors still exhibit high performance in the field of localization, some scholars are beginning to explore the use of deep learning instead of some parts of the algorithm framework, rather than directly using end-to-end posture estimation models to completely replace traditional methods. Differentiatable RANSAC (Differentiable RANSAC, DSAC)[18] aims to replace deterministic hypothesis selection with probabilistic hypothesis selection so that RANSAC processes can be derived, such as Fig. As shown in 16, the "Scoring" step still uses the reprojection error as an indicator, except that the error is based on the entire image rather than the feature points, and the original process of screening feature point matching is replaced by a process of directly screening the camera posture assumption h by probability. Although the current method is more limited, DSAC provides a feasible idea for how to add prior knowledge to the current unsupervised positioning algorithm framework.

Figure 16: Differential RANSAC algorithm framework

Posture calculation

For the correct pair of matched points obtained, the corresponding Transformation matrix needs to be calculated by geometric constraints. Since the point coordinates in the database and the camera pose at the time of sampling are known, the position of the current camera can be obtained by matching the transformation matrix of the point to the map point. Some basic symbols are defined here. The camera internal reference is , and the homogeneous form of the transformation moment is:

where, is the rotation matrix, is the translation matrix.

2.4.1 2D-2D transformation matrix calculations

An article to understand the visual global positioning technology based on feature points in automatic driving

Figure 17: Counterpoint geometry in 2D-2D transformation matrix calculations

For pairs of matched feature points () in two two-dimensional images, their coordinates on the normalized plane are (), and the corresponding transformation matrix needs to be calculated by constraining the pole. Such as Fig. As shown in 17, the geometric meaning is that there are three coplanars, and this plane is also called the polar plane, called the baseline, called the polar line. The contra-pole constraint includes both translation and rotation, defined as:

where, is the coordinate on the normalized plane and ∧ is the outer product operator. Counting the middle part of the formula into the matrix matrix and the matrix of essence, there are:

Since the essential matrix does not have scale information, the polar constraint remains true after E is multiplied by any non-zero constant. It can be solved by the classical 8-point-algorithm and then decomposed to obtain , . Therefore, it can be seen that the transformation matrix solution method of 2D-2D has two disadvantages, first of all, monocular vision has scale uncertainty, and scale information needs to be provided in initialization. Correspondingly, monocular initialization cannot be pure rotation only, but must have a sufficient degree of translation, otherwise it will result in zero.

2D-3D transformation matrix calculation

2D-3D matching is an important method in pose estimation. The PnP method, where the 2D-3D matching point is known, is used to solve the transformation matrix to obtain the camera pose. We project 3D points P(X, Y, Z) onto the camera imaging plane ():

where, for the scale, . The solution of this equation can be reduced to a linear equation problem, and each feature can provide two linear constraints:

In this way, the matching points can be solved by a minimum of 6, and when the number of matches is greater than 6, the least squares can be constructed using methods such as SVD. The P3P method can be seen as a special solution to the PnP method, such as Fig. As shown in 18, adding more constraints using the similarity of triangles requires only 3 pairs of points to solve. Other solutions include Direct linear transformation (DLT), EPnP (Efficient PnP), and UPnP (Uncalibrated PnP). Compared with the above linear optimization methods, nonlinear optimization methods such as Bundle Adjustment(BA) also have a wide range of applications. The BA method is a kind of "golden oil" in visual SLAM, which can optimize multiple variables at the same time, which can alleviate the system of local error to a certain extent, and interested students can read the relevant information to understand more deeply.

An article to understand the visual global positioning technology based on feature points in automatic driving

Figure 18: P3P method in 2D-3D transformation matrix calculations

3D-3D transformation matrix calculation

The transformation matrix between 3D points can be solved using the Iterative closet point (ICP) algorithm. Assuming that the point pair matching () results are correct, the resulting transformation matrix should minimize the reprojection error. You can use SVD to solve least squares problems:

Alternatively, use Bundle Adjustment, a nonlinear optimization method built on Lie algebra

where, the camera pose is represented. The optimization goal here is similar to the Bundle Adjustment in 2D-3D matching, but there is no need to consider the camera internal reference, because the 2D points on the original image have been projected from the camera imaging plane to the 3D world through binocular cameras or RGB-D depth cameras.

ICP problems have been shown to exist in cases where there are only one solutions and infinitely many solutions. Therefore, in the presence of a unique solution, the optimization function is equivalent to a convex function, and the minimum value is the global optimal solution, and this unique solution can be obtained regardless of the initialization. This is a great advantage of the ICP approach.

In this paper, we introduce the pose estimation algorithm based on feature points from four aspects: image description, map query, feature matching, and pose calculation. Although the traditional visual global positioning method is still the first choice in practical applications, the traditional method is based on the premise that the feature points are correctly defined, correctly extracted, correctly matched, and correctly observed, which is a huge challenge for vision itself. Secondly, because the traditional method is a multi-stage framework, rather than an end-to-end, each link in the middle, the interaction between the links, requires many parameter adjustments, and the technology of each link can be used as a separate research direction. In practice, it is also necessary to add a large number ofricks corresponding to specific scenarios, which is more complex in engineering.

Expectations of the end-to-end approach have spawned networks such as PoseNet, VLocNet, HourglassNet, etc., which have achieved good results in benchmarking. The author believes that there are still many problems in the current end-to-end method, mainly including the lack of geometric constraints of loss function, the 6 degrees of freedom space of the posture when building the map is not continuous, it is difficult to form a good mapping with the input space, and the corresponding posture regression and refinement mechanism are lacking. It cannot be denied that as the most powerful modeling tool for nonlinear spaces, deep learning will appear more in the field of localization in the future.

Returning to the visual positioning itself, because the most important advantages of vision are low cost, rich semantics, and less restrictions on the use of scenes. Therefore, the positioning fusion scheme based on vision and supplemented by other low-cost sensors will also be an important topic in the future.

Resources

[1] Dalal, N., and B. Triggs. ”Histograms of oriented gradients for human detection.” CVPR, 2005.

[2] Lowe, David G. ”Distinctive Image Features from Scale-Invariant Keypoints.” IJCV, 2004.

[3] Bay, Herbert, T. Tuytelaars, and L. V. Gool. ”SURF: Speeded Up Robust Features.” ECCV, 2006. [4] P.F.Alcantarilla,J.Nuevo,andA.Bartoli.Fast explicit diffusion for accelerated features in nonlinear scale spaces. BMVC, 2013.

[5] Ojala, Timo. ”Gray Scale and Rotation Invariant Texture Classification with Local Binary Patterns.” ECCV, 2000.

[6] Rublee, Ethan , et al. ”ORB: An efficient alternative to SIFT or SURF.” ICCV, 2011.

[7] Leutenegger, Stefan , M. Chli , and R. Y. Siegwart . ”BRISK: Binary Robust invariant scalable keypoints.” ICCV, 2011

[8] Alahi, Alexandre , R. Ortiz , and P. Vandergheynst . ”FREAK: Fast retina keypoint.” CVPR, 2012.

[9] Witkin, A P, M. Baudin, and R. O. Duda. ”Uniqueness of the Gaussian Kernel for Scale-Space Filtering.” TPAMI, 1986.

[10] Jegou, Herve , et al. ”Aggregating local descriptors into a compact image representation.” CVPR, 2010.

[11] Torii, Akihiko , et al. ”24/7 place recognition by view synthesis.” CVPR, 2015.

[12] Arandjelovic, Relja, et al. ”NetVLAD: CNN architecture for weakly supervised place recognition.” TPAMI, 2017.

[13] Li, Fei Fei . ”A Bayesian Hierarchical Model for Learning Natural Scene Categories. CVPR, 2005.

[14] Galvez-Lopez, D. , and J. D. Tardos . ”Bags of Binary Words for Fast Place Recognition in Image Sequences.” TRO, 2012.

[15] Sattler, Torsten , B. Leibe , and L. Kobbelt . ”Efficient & Effective Prioritized Matching for Large- Scale Image-Based Localization.” TPAMI, 2016.

[16] Fischler, Martin A., and R. C. Bolles. ”Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.” Communications of the ACM, 1981.

[17] Raguram, Rahul , et al. ”USAC: A Universal Framework for Random Sample Consensus.” TPAMI, 2013.

[18] Brachmann, Eric, et al. ”DSAC —Differentiable RANSAC for Camera Localization.” CVPR, 2017.

Reproduced from the network, the views in the text are only for sharing and exchange, do not represent the position of this public account, such as copyright and other issues, please inform, we will deal with it in a timely manner.

-- END --

Read on