laitimes

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

author:Institute of Computer Vision

Follow and star

Never get lost

Computer Vision Research Institute

Face recognition accuracy improvement | Transformer-based face recognition (with source code)
Face recognition accuracy improvement | Transformer-based face recognition (with source code)

Official ID | Computer Vision Research Institute

Study group|Scan the code to get the joining method on the homepage

Source code download | Reply "FT" to get the source code

Thesis: https://arxiv.org/pdf/2103.14803.pdf

Column of the Institute of Computer Vision

Column of Computer Vision Institute

At this stage, face detection and recognition technology has been particularly mature, no matter what field there are particularly mature applications, such as: unmanned supermarkets, station detection, prisoner arrest and tracking applications. However, most applications are based on large amounts of data, which is still very expensive. Therefore, the accuracy of face recognition still needs to be further improved, so we must continue to optimize a better face recognition framework.

Face recognition accuracy improvement | Transformer-based face recognition (with source code)
Face recognition accuracy improvement | Transformer-based face recognition (with source code)

First, the technical review - Transformer

Compared to convolution, what is the difference between Transformer and what are the advantages?

  1. Convolution has a strong inductive bias (e.g. local connectivity and translational invariance), and while this is undoubtedly valid for some relatively small training sets, these limit the expressive ability of the model when we have a very sufficient data set. Compared to CNNs, Transformers have less inductive bias, which allows them to express a wider range and thus more suitable for very large data sets;
  2. Convolutional kernels are specifically designed to capture local spatiotemporal information, and they cannot model dependencies beyond the receptive field. Although stacking convolutions and deepening the network expands the receptive field, these strategies still limit long-term modeling by aggregating information over a short range of information. In contrast, the self-attention mechanism can be used to capture local and global long-range dependencies by directly comparing features at all space-time locations;
  3. When applied to high-definition long videos, training deep CNN networks is very computationally intensive. Studies have found that in the field of still images, Transformer training and derivation is faster than CNNs. This enables the same computing resources to be used to train better-fitting networks.

II. Brief

Recently, there has been growing interest not only in Transformer's NLP, but also in computer vision. We wondered if Transformer could be used for face recognition and if it was better than cnns.

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

Therefore, some researchers have studied the performance of the Transformer model in face recognition. Considering that the original Transformer may have ignored inter-patch information, the researchers modified the patch generation process so that overlapping sliders became identifiers. These models are trained on CASIA-WebFace and MSSeleb-1M databases and evaluated on several mainstream benchmarks, including LFW, SLLFW, CALFW, CPLFW, TALFW, CFP-FP, AGEDB, and IJB-C databases. The researchers demonstrated that the face transformer model trained on the large-scale database MS-Celeb-1M achieves similar performance to CNNs with a similar number of parameters and MACs.

二、FACE TRANSFORMER

2.1 Web Framework Love

The face transformer model uses ViT[A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929] architecture, using the original Transformer. The only difference is that the researchers modified ViT's marker generation method to generate markers with sliders, even if the image blocks overlap, to better describe the inter-block information, as shown in the figure below.

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

Specifically, the slider is extracted from image X with a block size of P and a stride S (implicitly zero on both sides of the input), and finally a series of flat two-dimensional blocks Xp are obtained. (W,W) is the resolution of the original image, while (P,P) is the resolution of each image block.

As ViT does, trainable linear projection maps flat blocks Xp to model dimension D and output block embedding XpE. Class marker, i.e. a learnable embedding (Xclass=z) attached to a block embedding, its state at the output of the Transformer encoder (z) is the final face image embedding, as shown below.

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

Then, add the location embedding to the block embedding to preserve the location information.

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

The key module MSA of Transformer consists of k-parallel self-attention (SA):

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

The output of the MSA is the connection of the k-note header output

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

2.2 Loss Function

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

The Softmax-based loss function removes the bias term and transforms Wx=scosθj, and in the cosθyi term, [J. Deng, J. Guo, N. Xue, and S. Zafeiriou, "Arcface: Additive angular margin loss for deep face recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition] added a large margin. Therefore, the Softmax-based loss function can be expressed as:

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

[9] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

3. Experiment and visualization

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

For the ViT model, the number of layers is 20, the number of heads is 8, the hidden size is 512, and the MLP size is 2048. From a portion of the T2T-ViT model - Token-to-Token, the depth is 2, the hidden is 64, and the MLP size is 512; For the backbone network, the number of layers is 24, the number of heads is 8, the hidden size is 512, and the MLP size is 2048. Note that "ViT-P10S8" means that the ViT model has 10×10patch dimensions, and the stride S=8 and "ViT-P8S8" indicate that there is no overlap between the markers.

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

With the help of Attention Rollout technology, the researchers analyzed how the Transformer model (MS-Celeb-1M, ViT-P12S8) focused on the face image and found how the Face Transformer model focused on the face area as expected.

Face recognition accuracy improvement | Transformer-based face recognition (with source code)
Face recognition accuracy improvement | Transformer-based face recognition (with source code)

(1) Visualization of attention matrices at different levels. (2) refers to the attention distance of the participation area based on the head and network depth.

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

With the increase of occlusion area, the recognition performance of the face transformer model and ResNet100 has been improved.

© THE END

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

Please contact this official account for authorization

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

The Computer Vision Research Institute Learning Group is waiting for you to join!

ABOUT

Computer Vision Research Institute

The Institute of Computer Vision is mainly involved in the field of deep learning, mainly focusing on target detection, object tracking, image segmentation, OCR, model quantization, model deployment and other research directions. The Institute shares the latest new framework of paper algorithms every day, provides one-click download of papers, and shares practical projects. The Institute mainly focuses on "technical research" and "practical landing". The Institute will share the practice process in different fields, so that everyone can truly experience the real scene of getting rid of the theory and cultivate the habit of hands-on programming and brain-thinking!

🔗

Read on