Face recognition accuracy improvement | Transformer-based face recognition (with source code)

Follow and star

Never get lost

Computer Vision Research Institute

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

Official ID | Computer Vision Research Institute

Study group|Scan the code to get the joining method on the homepage

Source code download | Reply "FT" to get the source code

Thesis: https://arxiv.org/pdf/2103.14803.pdf

Column of the Institute of Computer Vision

Column of Computer Vision Institute

At this stage, face detection and recognition technology has been particularly mature, no matter what field there are particularly mature applications, such as: unmanned supermarkets, station detection, prisoner arrest and tracking applications. However, most applications are based on large amounts of data, which is still very expensive. Therefore, the accuracy of face recognition still needs to be further improved, so we must continue to optimize a better face recognition framework.

First, the technical review - Transformer

Compared to convolution, what is the difference between Transformer and what are the advantages?

Convolution has a strong inductive bias (e.g. local connectivity and translational invariance), and while this is undoubtedly valid for some relatively small training sets, these limit the expressive ability of the model when we have a very sufficient data set. Compared to CNNs, Transformers have less inductive bias, which allows them to express a wider range and thus more suitable for very large data sets;
Convolutional kernels are specifically designed to capture local spatiotemporal information, and they cannot model dependencies beyond the receptive field. Although stacking convolutions and deepening the network expands the receptive field, these strategies still limit long-term modeling by aggregating information over a short range of information. In contrast, the self-attention mechanism can be used to capture local and global long-range dependencies by directly comparing features at all space-time locations;
When applied to high-definition long videos, training deep CNN networks is very computationally intensive. Studies have found that in the field of still images, Transformer training and derivation is faster than CNNs. This enables the same computing resources to be used to train better-fitting networks.

II. Brief

Recently, there has been growing interest not only in Transformer's NLP, but also in computer vision. We wondered if Transformer could be used for face recognition and if it was better than cnns.

Therefore, some researchers have studied the performance of the Transformer model in face recognition. Considering that the original Transformer may have ignored inter-patch information, the researchers modified the patch generation process so that overlapping sliders became identifiers. These models are trained on CASIA-WebFace and MSSeleb-1M databases and evaluated on several mainstream benchmarks, including LFW, SLLFW, CALFW, CPLFW, TALFW, CFP-FP, AGEDB, and IJB-C databases. The researchers demonstrated that the face transformer model trained on the large-scale database MS-Celeb-1M achieves similar performance to CNNs with a similar number of parameters and MACs.

二、FACE TRANSFORMER

2.1 Web Framework Love

The face transformer model uses ViT[A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929] architecture, using the original Transformer. The only difference is that the researchers modified ViT's marker generation method to generate markers with sliders, even if the image blocks overlap, to better describe the inter-block information, as shown in the figure below.

Specifically, the slider is extracted from image X with a block size of P and a stride S (implicitly zero on both sides of the input), and finally a series of flat two-dimensional blocks Xp are obtained. (W,W) is the resolution of the original image, while (P,P) is the resolution of each image block.

As ViT does, trainable linear projection maps flat blocks Xp to model dimension D and output block embedding XpE. Class marker, i.e. a learnable embedding (Xclass=z) attached to a block embedding, its state at the output of the Transformer encoder (z) is the final face image embedding, as shown below.

Then, add the location embedding to the block embedding to preserve the location information.

The key module MSA of Transformer consists of k-parallel self-attention (SA):

The output of the MSA is the connection of the k-note header output

2.2 Loss Function

The Softmax-based loss function removes the bias term and transforms Wx=scosθj, and in the cosθyi term, [J. Deng, J. Guo, N. Xue, and S. Zafeiriou, "Arcface: Additive angular margin loss for deep face recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition] added a large margin. Therefore, the Softmax-based loss function can be expressed as:

[9] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

3. Experiment and visualization

For the ViT model, the number of layers is 20, the number of heads is 8, the hidden size is 512, and the MLP size is 2048. From a portion of the T2T-ViT model - Token-to-Token, the depth is 2, the hidden is 64, and the MLP size is 512; For the backbone network, the number of layers is 24, the number of heads is 8, the hidden size is 512, and the MLP size is 2048. Note that "ViT-P10S8" means that the ViT model has 10×10patch dimensions, and the stride S=8 and "ViT-P8S8" indicate that there is no overlap between the markers.

With the help of Attention Rollout technology, the researchers analyzed how the Transformer model (MS-Celeb-1M, ViT-P12S8) focused on the face image and found how the Face Transformer model focused on the face area as expected.

(1) Visualization of attention matrices at different levels. (2) refers to the attention distance of the participation area based on the head and network depth.

With the increase of occlusion area, the recognition performance of the face transformer model and ResNet100 has been improved.

Please contact this official account for authorization

The Computer Vision Research Institute Learning Group is waiting for you to join!

ABOUT

Computer Vision Research Institute

The Institute of Computer Vision is mainly involved in the field of deep learning, mainly focusing on target detection, object tracking, image segmentation, OCR, model quantization, model deployment and other research directions. The Institute shares the latest new framework of paper algorithms every day, provides one-click download of papers, and shares practical projects. The Institute mainly focuses on "technical research" and "practical landing". The Institute will share the practice process in different fields, so that everyone can truly experience the real scene of getting rid of the theory and cultivate the habit of hands-on programming and brain-thinking!

🔗

Face recognition accuracy improvement | Transformer-based face recognition (with source code)

Compared to convolution, what is the difference between Transformer and what are the advantages?

Read on

Huawei Smart Door Lock Plus is officially on sale with AI face recognition technology, so you don't have to wait to get home

For the decoration of a new house, you must install 10 good things, all of which hit the pain points of life! 1. If you plan to buy a sweeping robot for your home, you must decorate the house

Hu Xijin: Face recognition technology is overused in some fields [with face recognition industry application analysis]

Face recognition turnstile access control system solution

Without a key, you can quickly enter and exit the entrance door, VOC face recognition smart door lock T5pro review

Smart security gates and facial recognition technology: a new force in the field of security and examination

Junior high school students were deceived of 1.1 million when they bought out-of-print skins: they said that face recognition was needed for online classes, and they deceived their grandmother's bank card

Face recognition for an introduction to Python penetration testing

Recently, the old fingerprint door lock at home finally went on strike, and the chain always dropped at a critical moment, either slow to identify or simply unable to recognize. Every time I go home, it's like I'm struggling with the door lock, really

A set of weak current face recognition system construction plan, a good material for the plan!

"AI Godmother" Li Feifei: Developed the world's best database! That is, now face recognition

What should I pay attention to when starting a face recognition smart door lock? To analyze the ranking of the top ten brands, the key is to look at three points

Is it the real fragrance or the truth? Don't be fooled anymore! Who is better for smart door lock 3D face recognition?

[TCLK7GPro3D Face Recognition Smart Lock 1099 Buy? 】Main configuration: Apple's same 3D structured light face recognition technology, but also supports semiconductor fingerprints, passwords,

New Arrivals | Sakura DZ-F5Pro face recognition fully automatic smart lock

Anti-rubbing electricity can also make money in reverse, and the world's first face recognition energy-efficient charging pile is here!