天天看點

論文閱讀筆記(二十三):FaceNet: A Unified Embedding for Face Recognition and Clustering

Despite significant recent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.

盡管人臉識别領域最近取得了重大進展[10、14、15、17],但在規模上有效地實施人臉驗證和識别,對目前的研究方法提出了嚴峻的挑戰。在本文中,我們提出了一個叫做FaceNet的系統,它直接從臉部圖像學習到一個緊湊的歐幾裡得空間,距離直接對應于面部相似度的度量。一旦完成了這個空間,就可以通過使用帶有FaceNet嵌入特性的标準技術輕松實作人臉識别、驗證和叢集等任務。

Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face.

我們的方法使用深度卷積網絡來直接優化嵌入本身,而不是像之前的深度學習方法那樣的中間瓶頸層。為了訓練,我們使用一種新穎的線上三重挖掘方法生成的比對/不比對的面塊。我們的方法的好處是更大的代表性效率:我們實作了最先進的人臉識别性能,每個臉隻有128個位元組。

On the widely used Labeled Faces in the Wild (LFW) dataset, our system achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result [15] by 30% on both datasets.

We also introduce the concept of harmonic embeddings, and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.

在廣泛使用的LFW資料集上,我們的系統實作了99.63%的新記錄精度。在YouTube上,DB達到了95.12%。我們的系統将這兩個資料集的錯誤率降低了30%。

In this paper we present a unified system for face verification (is this the same person), recognition (who is this person) and clustering (find common people among these faces). Our method is based on learning a Euclidean embedding per image using a deep convolutional network. The network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances.

在本文中,我們提出了一個統一的人臉識别系統(就是這個人)、識别(此人就是這個人)和聚類(在這些面孔中找到普通的人)。我們的方法是建立在學習一種歐幾裡得的嵌入每個圖像使用一個深度卷積網絡。該網絡經過訓練,使嵌入空間中的平方L2距離直接對應于人臉的相似度:同一個人的面有小距離,不同人群的面距離較大。

Once this embedding has been produced, then the aforementioned tasks become straight-forward: face verification simply involves thresholding the distance between the two embeddings; recognition becomes a k-NN classification problem; and clustering can be achieved using off-theshelf techniques such as k-means or agglomerative clustering.

一旦這種嵌入已經産生,那麼上述的任務就會變得很簡單:臉驗證僅僅涉及到兩個嵌入之間的距離的門檻值;識别成為k - nn分類問題;可以使用諸如k - means或聚集叢集等現成技術實作叢集化。

Previous face recognition approaches based on deep networks use a classification layer [15, 17] trained over a set of known face identities and then take an intermediate bottleneck layer as a representation used to generalize recognition beyond the set of identities used in training. The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representation size per face is usually very large (1000s of dimensions). Some recent work [15] has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network.

先前基于深層網絡的人臉識别方法采用了一種分類層[15,17],在一組已知的人臉識别上進行訓練,然後将一個中間的瓶頸層作為一種表征,用于在訓練中使用的識别集以外的識别識别。這種方法的缺點是它的不直接性和低效率:人們必須希望瓶頸表示能夠很好地推廣到新面孔;通過使用一個瓶頸層,每個面的表示尺寸通常非常大(10世紀的尺寸)。最近的一些研究[15]通過使用PCA降低了這種次元的次元,但這是一個線性轉換,可以很容易地在網絡的一層中學習。

In contrast to these approaches, FaceNet directly trains its output to be a compact 128-D embedding using a tripletbased loss function based on LMNN [19]. Our triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin. The thumbnails are tight crops of the face area, no 2D or 3D alignment, other than scale and translation is performed.

與這些方法不同的是,FaceNet直接将其輸出訓練成一個緊湊的128 - d嵌入,使用基于LMNN的三重損失函數[19]。我們的三胞胎由兩個比對的面部縮略圖和一個不比對的臉縮略圖組成,而損失的目标是将積極的對與消極的距離分開。縮略圖是臉部區域的緻密作物,沒有2D或3D對齊,除了縮放和翻譯。

Choosing which triplets to use turns out to be very important for achieving good performance and, inspired by curriculum learning [1], we present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains. To improve clustering accuracy, we also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.

選擇使用哪三胞胎對于獲得良好的表現來說是非常重要的,而受課程學習的啟發,我們提出了一種新的線上負面範例挖掘政策,它確定了網絡訓練作為三胞胎的難度不斷增加。為了提高聚類的準确性,我們還探索了一種積極的挖掘技術,這種技術鼓勵單個人的嵌入。

As an illustration of the incredible variability that our method can handle see Figure 1. Shown are image pairs from PIE [13] that previously were considered to be very difficult for face verification systems.

作為我們的方法可以處理的不可思議的可變性,請參見圖1。顯示的是PIE[13]的圖像對,以前被認為是很難的人臉驗證系統。

An overview of the rest of the paper is as follows: in section 2 we review the literature in this area; section 3.1 defines the triplet loss and section 3.2 describes our novel triplet selection and training procedure; in section 3.3 we describe the model architecture used. Finally in section 4 and 5 we present some quantitative results of our embeddings and also qualitatively explore some clustering results.

本文其餘部分的概述如下:在第2節中,我們回顧了該領域的文獻;第3.1節定義了三重損失,第3.2節描述了我們的新三合選擇和訓練過程;在第3.3節中,我們描述了所使用的模型架構。最後,在第四節和第五節中,我們給出了一些嵌入的定量結果,并定性地探讨了一些聚類結果。

Similarly to other recent works which employ deep networks [15, 17], our approach is a purely data driven method which learns its representation directly from the pixels of the face. Rather than using engineered features, we use a large dataset of labelled faces to attain the appropriate invariances to pose, illumination, and other variational conditions.

類似于其他使用深層網絡的近期作品[15,17],我們的方法是一種純粹的資料驅動方法,它直接從臉部的象素上學習它的表示。我們不使用工程特性,而是使用一個大資料集來達到适當的不變性,以構成、照明和其他變化的條件。

In this paper we explore two different deep network architectures that have been recently used to great success in the computer vision community. Both are deep convolutional networks [8, 11]. The first architecture is based on the Zeiler&Fergus [22] model which consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations, and max pooling layers. We additionally add several 1×1×d convolution layers inspired by the work of [9]. The second architecture is based on the Inception model of Szegedy et al. which was recently used as the winning approach for ImageNet 2014 [16]. These networks use mixed layers that run several different convolutional and pooling layers in parallel and concatenate their responses. We have found that these models can reduce the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.

在本文中,我們探索了兩種不同的深度網絡架構,它們最近在計算機視覺社群中取得了巨大的成功。兩者都是深度卷積網絡[8,11]。第一個架構基于Zeiler&Fergus[22]模型,該模型由多個交錯層、非線性激活、局部響應規範化和max池化層組成。我們另外添加幾個1×1×d卷積層受[9]的工作。第二個架構基于Szegedy等的初始模型,該模型最近被用作ImageNet 2014的獲勝方法[16]。這些網絡使用混合層來并行運作多個不同的卷積和彙聚層,并将它們的響應連接配接起來。我們發現,這些模型可以将參數數量減少多達20倍,并有可能減少類似性能所需的失敗次數。

There is a vast corpus of face verification and recognition works. Reviewing it is out of the scope of this paper so we will only briefly discuss the most relevant recent work.

有大量的面部驗證和識别功能。回顧一下這篇文章的範圍,我們隻會簡單地讨論一下最近最相關的工作。

The works of [15, 17, 23] all employ a complex system of multiple stages, that combines the output of a deep convolutional network with PCA for dimensionality reduction and an SVM for classification.

[15,17,23]的作品都采用了複雜的多階段系統,将一個深卷積網絡的輸出與PCA進行了維數減少和SVM分類。

Zhenyao et al. [23] employ a deep network to “warp” faces into a canonical frontal view and then learn CNN that classifies each face as belonging to a known identity. For face verification, PCA on the network output in conjunction with an ensemble of SVMs is used.

Zhenyao等人[23]利用一個深度網絡将面孔“扭曲”成一個典型的正面視圖,然後學習CNN,将每個面孔分類為一個已知的身份。為了進行面部驗證,使用了與SVMs內建的網絡輸出的PCA。

Taigman et al. [17] propose a multi-stage approach that aligns faces to a general 3D shape model. A multi-class network is trained to perform the face recognition task on over four thousand identities. The authors also experimented with a so called Siamese network where they directly optimize the L1-distance between two face features. Their best performance on LFW (97.35%) stems from an ensemble of three networks using different alignments and color channels. The predicted distances (non-linear SVM predictions based on the x2 kernel) of those networks are combined using a non-linear SVM.

Taigman等[17]提出了一種将人臉與通用三維形狀模型相結合的多級方法。經過訓練,一個多類網絡對超過4000個身份進行人臉識别任務。作者還嘗試了一個所謂的Siamese網絡,他們直接優化了兩個臉部特征之間的L - 1距離。他們在LFW上的最佳表現(97.35%)來自三個使用不同的對齊和顔色通道的網絡。預測的距離(x2核心)基礎上的非線性支援向量機預測的網絡使用非線性支援向量機相結合。

Sun et al. [14, 15] propose a compact and therefore relatively cheap to compute network. They use an ensemble of 25 of these network, each operating on a different face patch. For their final performance on LFW (99.47% [15]) the authors combine 50 responses (regular and flipped). Both PCA and a Joint Bayesian model [2] that effectively correspond to a linear transform in the embedding space are employed. Their method does not require explicit 2D/3D alignment. The networks are trained by using a combination of classification and verification loss. The verification loss is similar to the triplet loss we employ [12, 19], in that it minimizes the L2-distance between faces of the same identity and enforces a margin between the distance of faces of different identities. The main difference is that only pairs of images are compared, whereas the triplet loss encourages a relative distance constraint.

Sun等[14,15]提出了一個緊湊的,是以相對便宜的計算網絡。他們使用了25個這樣的網絡,每一個都使用不同的面部貼片。在LFW(99.47%[15])上,作者将50個回複(規則和翻轉)結合在一起。采用PCA和聯合貝葉斯模型[2],有效地對應了嵌入空間中的線性變換。他們的方法不需要顯式的2D / 3D對齊方式。通過使用分類和驗證損失的組合來訓練網絡。驗證損失類似于我們使用的三重損失[12,19],因為它最小化了相同身份的面之間的 L2 -距離,并在不同身份的面之間設定了一個邊界。主要的差別在于,隻有對圖像進行比較,而三重損失則鼓勵相對距離限制。

A similar loss to the one used here was explored in Wang et al. [18] for ranking images by semantic and visual similarity.

一個類似的損失在這裡使用的是wang等人 [18] 由語義和視覺相似性進行的圖像排列。

FaceNet uses a deep convolutional network. We discuss two different core architectures: The Zeiler&Fergus [22] style networks and the recent Inception [16] type networks. The details of these networks are described in section 3.3.

FaceNet使用一個深度卷積網絡。我們讨論了兩個不同的核心架構:Zeiler&Fergus[22]風格的網絡和最近的Inception[16]類型網絡。這些網絡的細節在第3.3節中描述。

Given the model details, and treating it as a black box (see Figure 2), the most important part of our approach lies in the end-to-end learning of the whole system. To this end we employ the triplet loss that directly reflects what we want to achieve in face verification, recognition and clustering. Namely, we strive for an embedding f(x), from an image x into a feature space Rd, such that the squared distance between all faces, independent of imaging conditions, of the same identity is small, whereas the squared distance between a pair of face images from different identities is large.

考慮到模型的細節,并将其視為一個黑盒(見圖2),我們的方法最重要的部分在于整個系統的端到端學習。為此,我們采用三重損失直接反映了我們在人臉驗證、識别和叢集中所要達到的目标。即,我們争取一個嵌入f(x),從一個圖像x到特征空間 Rd,這樣的平方距離所有面孔,獨立的成像條件下,相同的身份很小,而平方距離的不同身份的大臉圖像。

Although we did not directly compare to other losses, e.g. the one using pairs of positives and negatives, as used in [14] Eq. (2), we believe that the triplet loss is more suitable for face verification. The motivation is that the loss from [14] encourages all faces of one identity to be projected onto a single point in the embedding space. The triplet loss, however, tries to enforce a margin between each pair of faces from one person to all other faces. This allows the faces for one identity to live on a manifold, while still enforcing the distance and thus discriminability to other identities.

雖然我們并沒有直接與其他損失進行比較,例如,在[14]Eq .(2)中使用成對的陽性和陰性,但我們認為三重損失更适合于面部驗證。損失的動機是[14]鼓勵所有的面孔一個身份是˘aÿ預計˘´Z到嵌入空間中的一個點。然而,三重損失試圖在每一對人臉之間執行一個邊界,從一個人臉到所有其他的臉。這樣一來,人臉就可以在流形上生存,同時還能保持距離,進而辨識出其他身份。

The following section describes this triplet loss and how it can be learned efficiently at scale.

下面的部分描述了這個三重損失,以及如何在縮放後有效地學習它。

These are very interesting findings and it is somewhat surprising that it works so well. Future work can explore how far this idea can be extended. Presumably there is a limit as to how much the v2 embedding can improve over v1, while still being compatible. Additionally it would be interesting to train small networks that can run on a mobile phone and are compatible to a larger server side model.

我們提供了一種直接學習嵌入到歐幾裡得空間的方法來進行人臉驗證。這使它有别于使用CNN瓶頸層的其他方法[15,17],或者需要額外的後處理,例如多個模型和PCA的連接配接,以及SVM分類。我們的端到端訓練既簡化了設定,又表明直接優化與手頭任務相關的損失提高了性能。

我們模型的另一個優點是,它隻需要最小的對準(在臉部區域的密集作物)。[17]例如,執行複雜的3D對齊。我們還嘗試了相似轉換比對,并注意到這實際上可以稍微提高性能。不清楚它是否值得額外的複雜性。

未來的工作将集中于更好地了解錯誤案例,進一步改進模型,并減少模型大小和減少CPU需求。我們還将研究如何改進目前極長的訓練時間,例如,伴随着較小批量大小的改變,我們課程學習的變化,以及離線/線上積極和消極的挖掘。

Figure 1. Illumination and Pose invariance. Pose and illumination have been a long standing problem in face recognition. This figure shows the output distances of FaceNet between pairs of faces of the same and a different person in different pose and illumination combinations. A distance of 0.0 means the faces are identical, 4.0 corresponds to the opposite spectrum, two different identities. You can see that a threshold of 1.1 would classify every pair correctly.

圖1所示。光照和姿态不變性。姿勢與照度是人臉識别中的一個長期存在的問題。這個圖顯示了相同的人臉和不同的姿勢和照明組合的人臉之間的距離。0.0表示兩張臉是相同的,4.0對應于相反的光譜,兩種不同的恒等式。您可以看到,1.1的門檻值将對每一對進行正确的分類。

Figure 2. Model structure. Our network consists of a batch input layer and a deep CNN followed by L2 normalization, which results in the face embedding. This is followed by the triplet loss during training.

圖2.模型結構。我們的網絡由一個批輸入層和一個深CNN組成,然後是 L2的标準化,這導緻了人臉的嵌入。接下來是訓練期間的三重損失。

Figure 3. The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.

圖3。三重損失最小化了錨和相同身份的正值之間的距離,并最大化錨點與不同身份的負值之間的距離。

相關參考

繼續閱讀