Invited Article丨Multimodal Visual Structure Learning

Text / Li Xi

0 Introduction

In this paper, the previous research content of multimodal visual structure learning is sorted out from a new perspective, focusing on the characteristics and applications of spherical panoramic images.

Spherical images are more fisheye or panoramic 360°, and they have a lot of structural knowledge, mainly for applications such as autonomous driving, virtual reality, road monitoring, interior decoration, and virtual reality. Here, we want to effectively model and structure the scene in a very inexpensive, simple way.

However, it is very difficult to analyze and apply images to this spherical structure, so in the study of spherical images, we put aside its application and only look at its mathematical reasoning or other problems, hoping to dissect these problems and form an academic problem. Because the current method of spherical image calculation is more suitable for matrix operation, image segmentation and detection are oriented to this kind of matrix image. For example, the application of the latest image generation is also to use the knowledge of rectangular images to generate, and become the local spatial perception of points and points, which obey certain physical laws between them, so that they can be decoded in reverse; decoding is the process of propagation and noise reduction, that is, a complete experience is formed. However, in practice, it is difficult to analyze spherical images directly because they are not very regular matrices, so it is common practice to expand spherical images. When it is expanded, there will be questions such as what angle to unfold, what is the connection between the front and back between them, whether there are geometric properties, and even its density. Because the density of the center point of the sphere, such as the density near the equator is relatively large, and the density at the limit and pole is small, the image density produced by the unfolding sphere in this state is extremely uneven, which is very difficult for AI algorithms to handle.

Therefore, how to do deep learning for applications and prospects is a very interesting application technology.

In the course of our research, we will find that the images we see are either natural or artificial. For example, if we look at the process of making food, we will remember it, and then we will generate knowledge, and eventually we will make it into a "delicious dish". In this process, human beings have formed a processing chain and a sequence in the brain, and finally formed human cognition. And our perception of images is actually the perception of flat images. Now the so-called image perception is based on planar image perception, which has great limitations. For example, we require a resolution of 4 K, 8 K...... The resolution is high, but it is not a true perception, because the retina of the human eye is round, so its perception is definitely not a flat perception.

1 Research work on spherical images

1.1 SGAT4PASS：用于全景语义段的球形几何感知 Transformer

Spherical geometry-aware Transformer for panoramic semantic segments, i.e., Transformer image segmentation that uses geometric knowledge to establish spherical geometry on a spherical surface (Li et al., 2023). The most difficult problem here is how to encode the geometry into the deep network? The spherical image shown in Figure 1 is a crooked, irregularized mesh, which can be done with the Transformer, but it is an s structure, not a simple patch or a natural AIP structure, so when you unfold it in the process of doing it, there will be two parallel points identified by the small blue boxes shown in Figure 1, when they are actually on a sphere. The reason for this is that the original geometry is destroyed by this unfolding method, which produces a huge distortion and degrades the image quality.

Invited Article丨Multimodal Visual Structure Learning

Figure 1 Spherical image of the expanded grid

In order to solve the above problems, we model and propose a framework as shown in Figure 2, which is done from three levels: data level, patch level, and loss level. First, SGA image projection. Because the spherical shape has three dimensions: α, β, and γ, it rotates differently according to these three dimensions, and then augments in different data. After the spherical geometry is augmented, the image is used to make the segmentation so that it can perceive the whole change, indicating that this knowledge has been learned. Second, add symmetry constraints. Because after the ball rotates at any angle and is cut along the meridian, both sides obey a symmetrical structure. Symmetry is the observation of the direct change of the left and right halves of the ball when it is cut according to the meridian, i.e., the change of the image knowledge Δ. Δ is a symmetry relation, and if it is symmetrical, it reflects a structural change, so it is desirable to use symmetry knowledge to model. Third, the pixel density, that is, when the sphere shown in the lower right figure of Figure 2 expands along the red line and blue line (the red line area has the most pixels, and the blue line area has fewer pixels), the change in pixel density is used to weigh the pixels and finally reweight it. Because there is less knowledge learned, there are fewer pixels in the front and more pixels in the back, which creates an imbalance, and we want to correct it, that is, reweight it according to the latitude perimeter.

Figure 2 The Framework

The results shown in Table 1 and Figure 3 show that the proposed method will quickly improve the performance and PAcc of mIoU, as well as the stability of its performance.

Table 1 Comparison of the performance of this method with SOTA

Figure 3 Performance stability

Here we take advantage of the knowledge of geometry to produce some qualitative results. For example, the original image shown in Figure 4 shows a sofa and a door, which is labeled with the floor and cut in half. Because image segmentation emphasizes the receptive field, if the image segmentation algorithm is used, the receptive fields of the gate will not be connected together, so the segmentation will be wrong. Although the result is the same - incomplete and messy, but here we can divide the sofa completely, and most of the doors are pushed out, because we consider the geometry and know that the two structures are connected, so we complete it topology and flatten the sensory field to get a good result. With this result, the image can be rotated arbitrarily, and when rotated to a certain extent, it will be found that the rotated label is similarly aligned with the original label, and the cognitive result of geometric perception can be obtained by maintaining a basic assistant. That is, when the rotation of the pitch/roll/yaw angles is 5°/5°/180°, respectively, SGAT4PASS gets better results for the semantic classes "door" and "sofa" (see the red dotted box shown in Figure 4).

Figure 4 Visual comparison of SGAT4PASS and Trans4PASS+

1.2 SphereDiffusion：球形几何感知失真弹性扩散模块

The research work on spherical image segmentation has achieved a good result, and we will continue to deepen our work and do spherical image generation.

Spherical panoramic images have two characteristics: one is spherical distortion, and the text-object pre-training knowledge cannot be effectively utilized, and the feature extraction is difficult, resulting in semantic bias. Second, the existing model lacks geometric-aware design, which makes it difficult to learn and use spherical geometric features. How to make the model learn and use features to improve the quality of controllable spherical image generation, we have done the following research.

The spherical image generation process is the opposite of the above (see Figure 5). Because of the generation task of spherical geometry, the generation problem is a diffusion model, which removes the noise in the process of noise reduction and noise, and then adds noise, and continuously trains, iterates, and inferentiates. In fact, the hope here is to add geometric spherical shapes to the spherical model, and then through reminders, do the final boundaries and reuse and integrate the knowledge. In this process, the key idea is to put this particular spherical geometry into the frame for generation.

Figure 5 The spherical image generation process

Figure 6 shows our denoising core framework, with a few basic operations, the first of which is the Spherical SimSiam Contrastive Learning module, which does spherical spinning. A shared condition of ControlNet is added here to ensure that the results are consistent. The second action module is the Deformable Distortion-awareBlock (DDaB), which ensures that this interval is deformable. The third deformable module is Spherical Reprojection, in each generation process, we deliberately rotate the generation step, rotate the map to a certain extent, and then do a secondary projection to ensure the rotation consistency, and then generate, and then maintain the rotation consistency. In the process, diffusion and geometry knowledge are fully connected, and a very good result is obtained.

Figure 6 The process of generating denoising a spherical image

As can be seen from the results shown in Table 2, using the same hyperparameter settings and training cycles in the standard module for fair comparison, our method can substantially reduce the FID, FIDs, and IS indicators, solve the special panoramic image generation problem, and be more suitable for application.

Table 2 compares the results with existing methods on the Strcu-ture3D dataset

The end result is that we want to have a text prompt, such as A bedroom with white walls and a pink bed, which is a text prompt for segmentation. As shown in Figure 7, the final results of the method we use are very good and controllable, and the panoramic image can be directly generated, that is, the two-dimensional image can be generated directly on the three-dimensional spherical shape without doing the two-dimensional image.

Figure 7 Image Generation Results

2 LayoutDiffusion: A controllable diffusion model for layout-to-image generation

With the above results, we continued to study in depth and put the flat image generation into the model. The knowledge of layout is a structure in advertising design, and we want to make a flat layout of the sphere. The knowledge of layout is to control the bounding box, put its size and position labels here, and use it as a controllable map to generate images in reverse. For example, we want to encode the layout, such as images, locations, and coordinates, and then add a text prompt to generate the desired decoration design. The result of the problem is that we do the design, and we end up generating the original appeal, which is that we want to decode it. The first decoding is to put the position, size, semantics, target background and other structures of the box together. The most important result is that it can be generated in a better controlled manner. The generation done here is different from the mid journey, and it is more about wanting to be edited, because to do the generation, you need to train a large number of images. For example, in an application, you can simply drag a box to change the size of the specific position of the image. We have also made an interface for this, and it has been open sourced.

3 Use linguistic adaptive reasoning to cite to express comprehension

We hope to deepen it into the network, which involves cross-modality and requires the resonance of natural language and network visual processing of structural knowledge. That is, by looking at different pictures, and finding different focal points through verbal cues. For example, look at adults or children with different visual characteristics. Therefore, it is hoped that the adaptive visual structure of the generative language will be generated, that is, the characteristic visual characteristic pathways of different languages and different prompts will be different, just like the human cranial neural circuits, the neural circuits that are converted according to different prompts are different, but the overall structure of the network is the same. We want to achieve such a bionic network-like structure.

In the language adaptive dynamic subnetting framework shown in Figure 8, the BERT method is encoded to generate an on/off vector such as Blockbone and Christmas. The switch off is a sigma filter. Finally, it is modeled in the Transformer, that is, the visual pathway, the gate variable of a language feature filter, and finally an adaptive subnet, that is, different languages have different subs, and so on. Therefore, cross-modality is the mapping of the language features and the neural network feature inference structure itself, and these two mappings can form adaptive control.

Figure 8 The Language Adaptive Dynamic Subnet Framework

Figure 9 shows its technical principle, hoping to generate an FC after generating features, and then sink the binary feature to the feature map, and then do the gete vector, and Softmax to normalize the features to get the features.

Figure 9 Principles of gated network technology

The image shown in Figure 10 shows more intuitive features, as you can see from the figure, the gray bar is skipped and not executed, that is, different photos will see that the path of network execution is completely different. Because the amount of mapping computation is closely related to the model and the high language, we want to say different things and perform different paths, so that we can achieve controllable and dynamic adaptation, which is the core idea.

Figure 10 Dynamic characteristics of REC

4 Language Adaptive Weight Generation for Multitasking Visual Fundamentals

In the above section, we mainly introduced the execution module through language modulation, and further we studied the direct generation of feature parameters in language, that is, language control, and the controlled variables are passed through the W parameters in natural language, such as the visual parameters F(l:W,A), F(l:W,A) and F(l:W,A) as shown in Figure 11, and then do the task or Cross, that's the core idea.

Figure 11 Technical Principle

The technical principle is to use language features to structure the image, use external methods to do query, key, value, and finally generate the desired result.

(References omitted)

Li Xi

Vice Dean and Professor of Shanghai Advanced Research Institute of Zhejiang University, National Outstanding Young Scholar Award, IET Fellow, National Leading Talent, Leader of the Major Science and Technology Project of Science and Technology Innovation 2030-"New Generation Artificial Intelligence" of the Ministry of Science and Technology, Key Project of the Joint Fund of the National Natural Science Foundation of China, and Key Planning and Research Project of the Ministry of Education. He has published more than 180 papers in international authoritative journals and conferences, and many of them are highly cited by ESI. He has won the SAIL Award of the World Artificial Intelligence Conference, the International Conference Paper Award, the first prize of the Entrepreneurship and Innovation Award of the China Association of Inventions, the first prize of the Science and Technology Progress Award of the Ministry of Education, and the second prize of the CSIG Natural Science Award.

Excerpt from "Newsletter of Chinese Society of Artificial Intelligence"

Vol. 14, No. 2, 2024

Special topics on the frontiers of science and technology

Invited Article丨Multimodal Visual Structure Learning

Read on

Do enough "water articles" and answer the "ecological volume"

New Era, New Journey, New Great Cause丨Huamei Company: Make clean coal articles and lay out new industries

期刊导航 | Urban Studies文章精选(229-233)

Wang Yongchang: Do a good job of big articles in "micro".

Langya District: Do a good job of "people" and accelerate the construction of new urbanization

Extremely fast efficiency increase, extreme cost reduction, and rapid transformation Dongjiang Environmental Protection has done "three articles" to open a new situation and climb high

Doing a good job in five major articles is the core essence of the high-quality development of bank wealth management|wealth and asset management

From history to modern times, this article explains feminism thoroughly

Are you still angry after reading this article? There are only four reasons why a child is reincarnated into your home

What is the problem with Yang Xiaoming, be wary of someone taking the opportunity to make a fuss!

Sun Jinlong, Secretary of the Party Leadership Group of the Ministry of Ecology and Environment, published a signed article in the "Learning Times" entitled "Comprehensively Promoting the Modernization of Harmonious Coexistence between Man and Nature with the Construction of Beautiful China"

The Standing Committee of the Jingyuan County Party Committee held a reading class for party discipline study and education

#头条创作挑战赛#又到五一放假时, how to spend this year's May Day, it seems that there is no plan yet. The reason why this is so is because there is a study plan every day, and when it comes to festivals

Party discipline learning and education丨Learning language: consciously be a powerful promoter of a good political ecology

The collective study seminar of the theoretical study center group of the county party committee was held

A few days ago, I wrote an article excitedly, adding a lot of memories of the post-70s and 80s. I have high expectations for this article, and I expect it to resonate with many friends born in the 70s and 80s

Learn the new "Secrecy Law" and get started quickly: you have to master these ten knowledge points

【Party Discipline Learning and Education】One lesson a day | What are the punishment provisions of the "Regulations on Disciplinary Actions of the Communist Party of China" for engaging in dupartitude and being a two-faced person?

Why don't women like men who are too thin?

Why do you suddenly feel that May Day travel is not interesting? After reading the article, you will understand

Preventing Telecom Fraud, this article is all dry goods!

How can party members and cadres enhance their discipline and determination? | Party discipline study and education

@党员干部, the "May Day" festival is not "out of season" | Party discipline study and education

Weekly Journal: Mathematical Modeling Learning (5)

Enjoy Chat: SolidWorks Learning(15)

Party discipline study and education | Strengthen warning education and build a strong ideological line of defense

Learning the spirit of model workers since childhood, Xuhui students have done this

Chen Muchi's comeback will be on CCTV? was criticized by netizens: The article and Jiang Jinfu are too wronged!

"Red Door" Open Visiting Day Mengwa immersed herself in learning fire safety knowledge

It's really surprising that the second brother, a 70-year-old farmer, would think so clearly, which is worth learning from

The first meeting of the special class for the study and education of party discipline of the district party committee was held

Frequent itchy ears are indicative of disease? Should itchy ears be removed? The doctor gave the truth: "Recently, my ears have been itching, and I can't help but want to pick them every time, but I heard that this is not good, what should I do?"