laitimes

Invited Article丨Multimodal Visual Structure Learning

author:Chinese Society of Artificial Intelligence

Text / Li Xi

0 Introduction

In this paper, the previous research content of multimodal visual structure learning is sorted out from a new perspective, focusing on the characteristics and applications of spherical panoramic images.

Spherical images are more fisheye or panoramic 360°, and they have a lot of structural knowledge, mainly for applications such as autonomous driving, virtual reality, road monitoring, interior decoration, and virtual reality. Here, we want to effectively model and structure the scene in a very inexpensive, simple way.

However, it is very difficult to analyze and apply images to this spherical structure, so in the study of spherical images, we put aside its application and only look at its mathematical reasoning or other problems, hoping to dissect these problems and form an academic problem. Because the current method of spherical image calculation is more suitable for matrix operation, image segmentation and detection are oriented to this kind of matrix image. For example, the application of the latest image generation is also to use the knowledge of rectangular images to generate, and become the local spatial perception of points and points, which obey certain physical laws between them, so that they can be decoded in reverse; decoding is the process of propagation and noise reduction, that is, a complete experience is formed. However, in practice, it is difficult to analyze spherical images directly because they are not very regular matrices, so it is common practice to expand spherical images. When it is expanded, there will be questions such as what angle to unfold, what is the connection between the front and back between them, whether there are geometric properties, and even its density. Because the density of the center point of the sphere, such as the density near the equator is relatively large, and the density at the limit and pole is small, the image density produced by the unfolding sphere in this state is extremely uneven, which is very difficult for AI algorithms to handle.

Therefore, how to do deep learning for applications and prospects is a very interesting application technology.

In the course of our research, we will find that the images we see are either natural or artificial. For example, if we look at the process of making food, we will remember it, and then we will generate knowledge, and eventually we will make it into a "delicious dish". In this process, human beings have formed a processing chain and a sequence in the brain, and finally formed human cognition. And our perception of images is actually the perception of flat images. Now the so-called image perception is based on planar image perception, which has great limitations. For example, we require a resolution of 4 K, 8 K...... The resolution is high, but it is not a true perception, because the retina of the human eye is round, so its perception is definitely not a flat perception.

1 Research work on spherical images

1.1 SGAT4PASS:用于全景语义段的球形几何感知 Transformer

Spherical geometry-aware Transformer for panoramic semantic segments, i.e., Transformer image segmentation that uses geometric knowledge to establish spherical geometry on a spherical surface (Li et al., 2023). The most difficult problem here is how to encode the geometry into the deep network? The spherical image shown in Figure 1 is a crooked, irregularized mesh, which can be done with the Transformer, but it is an s structure, not a simple patch or a natural AIP structure, so when you unfold it in the process of doing it, there will be two parallel points identified by the small blue boxes shown in Figure 1, when they are actually on a sphere. The reason for this is that the original geometry is destroyed by this unfolding method, which produces a huge distortion and degrades the image quality.

Invited Article丨Multimodal Visual Structure Learning

Figure 1 Spherical image of the expanded grid

In order to solve the above problems, we model and propose a framework as shown in Figure 2, which is done from three levels: data level, patch level, and loss level. First, SGA image projection. Because the spherical shape has three dimensions: α, β, and γ, it rotates differently according to these three dimensions, and then augments in different data. After the spherical geometry is augmented, the image is used to make the segmentation so that it can perceive the whole change, indicating that this knowledge has been learned. Second, add symmetry constraints. Because after the ball rotates at any angle and is cut along the meridian, both sides obey a symmetrical structure. Symmetry is the observation of the direct change of the left and right halves of the ball when it is cut according to the meridian, i.e., the change of the image knowledge Δ. Δ is a symmetry relation, and if it is symmetrical, it reflects a structural change, so it is desirable to use symmetry knowledge to model. Third, the pixel density, that is, when the sphere shown in the lower right figure of Figure 2 expands along the red line and blue line (the red line area has the most pixels, and the blue line area has fewer pixels), the change in pixel density is used to weigh the pixels and finally reweight it. Because there is less knowledge learned, there are fewer pixels in the front and more pixels in the back, which creates an imbalance, and we want to correct it, that is, reweight it according to the latitude perimeter.

Invited Article丨Multimodal Visual Structure Learning

Figure 2 The Framework

The results shown in Table 1 and Figure 3 show that the proposed method will quickly improve the performance and PAcc of mIoU, as well as the stability of its performance.

Table 1 Comparison of the performance of this method with SOTA

Invited Article丨Multimodal Visual Structure Learning
Invited Article丨Multimodal Visual Structure Learning

Figure 3 Performance stability

Here we take advantage of the knowledge of geometry to produce some qualitative results. For example, the original image shown in Figure 4 shows a sofa and a door, which is labeled with the floor and cut in half. Because image segmentation emphasizes the receptive field, if the image segmentation algorithm is used, the receptive fields of the gate will not be connected together, so the segmentation will be wrong. Although the result is the same - incomplete and messy, but here we can divide the sofa completely, and most of the doors are pushed out, because we consider the geometry and know that the two structures are connected, so we complete it topology and flatten the sensory field to get a good result. With this result, the image can be rotated arbitrarily, and when rotated to a certain extent, it will be found that the rotated label is similarly aligned with the original label, and the cognitive result of geometric perception can be obtained by maintaining a basic assistant. That is, when the rotation of the pitch/roll/yaw angles is 5°/5°/180°, respectively, SGAT4PASS gets better results for the semantic classes "door" and "sofa" (see the red dotted box shown in Figure 4).

Invited Article丨Multimodal Visual Structure Learning

Figure 4 Visual comparison of SGAT4PASS and Trans4PASS+

1.2 SphereDiffusion:球形几何感知失真弹性扩散模块

The research work on spherical image segmentation has achieved a good result, and we will continue to deepen our work and do spherical image generation.

Spherical panoramic images have two characteristics: one is spherical distortion, and the text-object pre-training knowledge cannot be effectively utilized, and the feature extraction is difficult, resulting in semantic bias. Second, the existing model lacks geometric-aware design, which makes it difficult to learn and use spherical geometric features. How to make the model learn and use features to improve the quality of controllable spherical image generation, we have done the following research.

The spherical image generation process is the opposite of the above (see Figure 5). Because of the generation task of spherical geometry, the generation problem is a diffusion model, which removes the noise in the process of noise reduction and noise, and then adds noise, and continuously trains, iterates, and inferentiates. In fact, the hope here is to add geometric spherical shapes to the spherical model, and then through reminders, do the final boundaries and reuse and integrate the knowledge. In this process, the key idea is to put this particular spherical geometry into the frame for generation.

Invited Article丨Multimodal Visual Structure Learning

Figure 5 The spherical image generation process

Figure 6 shows our denoising core framework, with a few basic operations, the first of which is the Spherical SimSiam Contrastive Learning module, which does spherical spinning. A shared condition of ControlNet is added here to ensure that the results are consistent. The second action module is the Deformable Distortion-awareBlock (DDaB), which ensures that this interval is deformable. The third deformable module is Spherical Reprojection, in each generation process, we deliberately rotate the generation step, rotate the map to a certain extent, and then do a secondary projection to ensure the rotation consistency, and then generate, and then maintain the rotation consistency. In the process, diffusion and geometry knowledge are fully connected, and a very good result is obtained.

Invited Article丨Multimodal Visual Structure Learning

Figure 6 The process of generating denoising a spherical image

As can be seen from the results shown in Table 2, using the same hyperparameter settings and training cycles in the standard module for fair comparison, our method can substantially reduce the FID, FIDs, and IS indicators, solve the special panoramic image generation problem, and be more suitable for application.

Table 2 compares the results with existing methods on the Strcu-ture3D dataset

Invited Article丨Multimodal Visual Structure Learning

The end result is that we want to have a text prompt, such as A bedroom with white walls and a pink bed, which is a text prompt for segmentation. As shown in Figure 7, the final results of the method we use are very good and controllable, and the panoramic image can be directly generated, that is, the two-dimensional image can be generated directly on the three-dimensional spherical shape without doing the two-dimensional image.

Invited Article丨Multimodal Visual Structure Learning

Figure 7 Image Generation Results

2 LayoutDiffusion: A controllable diffusion model for layout-to-image generation

With the above results, we continued to study in depth and put the flat image generation into the model. The knowledge of layout is a structure in advertising design, and we want to make a flat layout of the sphere. The knowledge of layout is to control the bounding box, put its size and position labels here, and use it as a controllable map to generate images in reverse. For example, we want to encode the layout, such as images, locations, and coordinates, and then add a text prompt to generate the desired decoration design. The result of the problem is that we do the design, and we end up generating the original appeal, which is that we want to decode it. The first decoding is to put the position, size, semantics, target background and other structures of the box together. The most important result is that it can be generated in a better controlled manner. The generation done here is different from the mid journey, and it is more about wanting to be edited, because to do the generation, you need to train a large number of images. For example, in an application, you can simply drag a box to change the size of the specific position of the image. We have also made an interface for this, and it has been open sourced.

3 Use linguistic adaptive reasoning to cite to express comprehension

We hope to deepen it into the network, which involves cross-modality and requires the resonance of natural language and network visual processing of structural knowledge. That is, by looking at different pictures, and finding different focal points through verbal cues. For example, look at adults or children with different visual characteristics. Therefore, it is hoped that the adaptive visual structure of the generative language will be generated, that is, the characteristic visual characteristic pathways of different languages and different prompts will be different, just like the human cranial neural circuits, the neural circuits that are converted according to different prompts are different, but the overall structure of the network is the same. We want to achieve such a bionic network-like structure.

In the language adaptive dynamic subnetting framework shown in Figure 8, the BERT method is encoded to generate an on/off vector such as Blockbone and Christmas. The switch off is a sigma filter. Finally, it is modeled in the Transformer, that is, the visual pathway, the gate variable of a language feature filter, and finally an adaptive subnet, that is, different languages have different subs, and so on. Therefore, cross-modality is the mapping of the language features and the neural network feature inference structure itself, and these two mappings can form adaptive control.

Invited Article丨Multimodal Visual Structure Learning

Figure 8 The Language Adaptive Dynamic Subnet Framework

Figure 9 shows its technical principle, hoping to generate an FC after generating features, and then sink the binary feature to the feature map, and then do the gete vector, and Softmax to normalize the features to get the features.

Invited Article丨Multimodal Visual Structure Learning

Figure 9 Principles of gated network technology

The image shown in Figure 10 shows more intuitive features, as you can see from the figure, the gray bar is skipped and not executed, that is, different photos will see that the path of network execution is completely different. Because the amount of mapping computation is closely related to the model and the high language, we want to say different things and perform different paths, so that we can achieve controllable and dynamic adaptation, which is the core idea.

Invited Article丨Multimodal Visual Structure Learning

Figure 10 Dynamic characteristics of REC

4 Language Adaptive Weight Generation for Multitasking Visual Fundamentals

In the above section, we mainly introduced the execution module through language modulation, and further we studied the direct generation of feature parameters in language, that is, language control, and the controlled variables are passed through the W parameters in natural language, such as the visual parameters F(l:W,A), F(l:W,A) and F(l:W,A) as shown in Figure 11, and then do the task or Cross, that's the core idea.

Invited Article丨Multimodal Visual Structure Learning

Figure 11 Technical Principle

The technical principle is to use language features to structure the image, use external methods to do query, key, value, and finally generate the desired result.

(References omitted)

Invited Article丨Multimodal Visual Structure Learning

Li Xi

Vice Dean and Professor of Shanghai Advanced Research Institute of Zhejiang University, National Outstanding Young Scholar Award, IET Fellow, National Leading Talent, Leader of the Major Science and Technology Project of Science and Technology Innovation 2030-"New Generation Artificial Intelligence" of the Ministry of Science and Technology, Key Project of the Joint Fund of the National Natural Science Foundation of China, and Key Planning and Research Project of the Ministry of Education. He has published more than 180 papers in international authoritative journals and conferences, and many of them are highly cited by ESI. He has won the SAIL Award of the World Artificial Intelligence Conference, the International Conference Paper Award, the first prize of the Entrepreneurship and Innovation Award of the China Association of Inventions, the first prize of the Science and Technology Progress Award of the Ministry of Education, and the second prize of the CSIG Natural Science Award.

Excerpt from "Newsletter of Chinese Society of Artificial Intelligence"

Vol. 14, No. 2, 2024

Special topics on the frontiers of science and technology

Read on