laitimes

With only one picture + camera position, AI can brain supplement the surrounding environment| CVPR2022

Standing at the door and taking a look, the AI can make up what the room looks like:

Is there a smell of online VR viewing?

Not only indoor effects, but also aerial shooting with long-range long shots is also so easy:

And the rendered images are all high-fidelity effects, as if they were taken with a truth machine.

Recently, the research on synthesizing 3D scenes with 2D pictures has been on fire wave after wave.

But in many past studies, the compositing scene was often confined to a relatively small space.

For example, the NeRF of the previous fire, the effect is to expand around the main body of the picture.

This time, the new development is to further extend the perspective and focus more on allowing AI to predict long-distance pictures.

For example, if you give a room doorway, it can synthesize the scene after passing through the door and walking through the corridor.

At present, the related papers of this study have been accepted by CVPR2022.

Enter a single frame and camera track

Let the AI speculate on the content behind it based on a picture, is this feeling a bit similar to letting the AI write an article?

In fact, the researchers used the transformer commonly used in the field of NLP this time.

Using the autoregressive Transformer method, they entered a single scene image and the camera motion trajectory, so that each frame of the generated picture corresponded to the motion trajectory position one-to-one, thus synthesizing a long-distance long-shot effect.

With only one picture + camera position, AI can brain supplement the surrounding environment| CVPR2022

The whole process can be divided into two stages.

The first stage pre-trains a VQ-GAN that maps the input image to the token.

VQ-GAN is a Transformer-based image generation model, the biggest feature of which is that the generated images are very high-definition.

In this section, the encoder encodes the image as a discrete representation and the decoder maps the representation to a high-fidelity output.

In the second stage, after processing the image into a token, the researchers used a GPT-like architecture to do autoregression.

During specific training, the input image and the starting camera trajectory position are encoded as a token of a specific mode, and a decoupled position input P.E. is added.

The token is then fed to an autoregressive Transformer to predict the image.

The model starts with a single image of the input and constantly increases the input by predicting the before and after frames.

The researchers found that not every frame generated at the trajectory moment was equally important. As a result, they also took advantage of a locality constraint to guide the model to focus more on the output of the keyframes.

This locality constraint is introduced through camera tracks.

Based on the camera trajectory position corresponding to the two frames, the researchers can locate the overlapping frames and determine where the next frame is.

To combine the above, they used MLP to calculate a "camera perception bias."

This approach makes it easier to optimize and plays a crucial role in ensuring the consistency of the generated picture.

Experimental results

This study experimented on the RealEstate10K, Matterport3D dataset.

The results show that the method produces better quality images than models that do not specify a camera trajectory.

Compared with the method of discrete camera tracks, the effect of this method is also significantly better.

The authors also performed a visual analysis of the model's attention span.

The results showed that more attention was contributed near the location of the motion trajectory.

With only one picture + camera position, AI can brain supplement the surrounding environment| CVPR2022

In the ablation experiment, the results show that the method on the Matterport3D dataset, the camera perception bias and the embedding of decoupling positions are all helpful in improving image quality and frame-to-frame consistency.

With only one picture + camera position, AI can brain supplement the surrounding environment| CVPR2022

Both authors are Chinese

Xuanchi Ren is an undergraduate student at the Hong Kong University of Science and Technology.

With only one picture + camera position, AI can brain supplement the surrounding environment| CVPR2022

He has interned at Microsoft's Asian Research Institute and collaborated with Professor Xiaolong Wang in the summer of 2021.

Xiaolong Wang is an assistant professor at the University of California, San Diego.

With only one picture + camera position, AI can brain supplement the surrounding environment| CVPR2022

He is a Ph.D. graduate of Carnegie Mellon University with a degree in Robotics.

Research interests include computer vision, machine learning, and robotics. Particularly in the areas of self-supervised learning, video comprehension, common-sense reasoning, reinforcement learning, and robotics.

Address: https://xrenaa.github.io/look-outside-room/

Read on