
The framework developed by the researchers can generate scene images based on textual descriptions of the spatial relationships of objects.
Author | Berry Tincture
Edit | Twilight
When describing scenes, humans often describe the spatial relationships between objects. Biometric recognition involves the interaction of top-down and bottom-up pathways, while deep neural networks only simulate the second pathway. Top-down visual pathways involve the globality, topology, and multi-solution characteristics of biological visual perception, especially when understanding images, they will face mathematical infinite solution problems. These characteristics may be the next improvement direction of deep neural networks.
"Visual scene understanding includes detecting and identifying objects, reasoning about the visual relationships between detected objects, and using statements to describe image areas." According to The Metaphor We Live, object relations are more fundamental than semantic relationships, because semantic relationships contain assumptions about object relationships.
In the picture below, two cats and cats are "beating each other", and the other cat is watching the hilarity next to it. With this example, humans can observe and capture the cat's location, behavior, and associations very clearly and directly. But many deep learning models cannot understand complex reality in the same way, capture all the information and parse it because they don't understand the "entangled" relationship between individual objects.
The problem, then, is that if these relationships are "not clear", it is difficult for a robot designed for use in the kitchen to perform instructions such as "pick up the iron pot and stew the goose on the stove under the cabinet to the left of the cutting board".
To enable the robot to perform these tasks precisely, Shuang Li, Yilun Du from the Massachusetts Institute of Technology, and Nan Liu from the University of Illinois at Urbana-Champaign collaborated to come up with a model that can understand the spatial relationships between objects in the scene. The model has good generalization capabilities, enabling the generation or editing of complex pictures by combining the spatial relationships of multiple objects. The paper has been accepted by NeurIPS 2021 as a Spotlight presentation.
Thesis link: https://arxiv.org/abs/2111.09297
Overall, there are three main contributions to the study:
1. Proposes a framework to decompose and combine the relationships between objects, which is capable of generating and editing images according to the spatial relationships between objects by combining descriptions, and is significantly superior to the baseline approach.
2. Be able to infer scene descriptions between potential objects and be able to understand semantic equivalence between objects well. Semantic equivalence is the same scene but different representations, e.g. apple on the left side of the banana and banana on the right side of the apple.
3. The most important method can be generalized to more complex descriptions of relationships that have not been seen before by combining descriptions of object relations.
This generalization can be applied to industrial robots performing complex multi-step manipulation tasks, such as stacking items in warehouses or assembling appliances. Allow machines to further "bionic" humans to learn from the environment, interact, and through continuous learning decomposition, combinations can quickly adapt to new environments and learn new tasks.
Yilun Du, a co-author, said: "When we see a table, we don't use the XYZ triaxial axis of the spatial coordinate system to express the position of the object, because that's not how the human brain works. Our insight into our surroundings is based on the relationships between objects. By building systems that understand the relationships between objects, machinery can be manipulated more effectively to change the surrounding scene. ”
A single relationship at a time
The highlight of the framework proposed by the researchers is that "it can interpret" the relationship between objects in a scene in a human way.
For example, enter a piece of text - the wooden table is on the right side of the blue sofa, and the wooden table is in front of the wooden cabinet.
The system first splits the sentence into two parts: "the wooden table is on the right side of the blue sofa" and "the wooden table is in front of the wooden cabinet", and then describes the spatial relationship between the individual parts one by one, and then models the probability distribution of each relationship, and converges these separated "structures" through the optimization process to finally generate a complete and accurate scene image.
Energy-Based Model
The researchers used the "energy-based model" in machine learning to encode the direct spatial relationships of each pair of objects, and then combined them like Lego bricks to describe the entire scene.
Co-author Shuang Li explains: "By reassembling descriptions between objects, the system produces good generalization capabilities that can generate or edit scenes that have not been seen before. ”
Yilun Du also said: "Other systems consider the relationship between the objects in the scene as a whole, and then generate the scene image at one time according to the text description. Once more complex scene descriptions are included, these models cannot truly understand and produce the desired scene images. We integrate these separate, smaller models to model more relationships so that novel combinations can be produced.
This model can also be operated in reverse. Given an image and different description text, it can accurately find the description text in the scene structure that matches the relationship with the object.
Understand complex scenarios
In each case, the model proposed by Nan Liu et al. outperformed the baseline.
"Our model only saw one description of the relationship between objects during training, but in testing, when the description of the relationship between the objects increased to two, three, or even four, our model still worked well, and other machine learning methods failed."
As shown in the figure, Image Editing lists the classification results of different methods on the CLEVR and iGibson datasets. The methods in this article are much better than the baselines – StyleGAN2 and StyleGAN2 (CLIP). The model on a subset of 2R and 3R tests also performs well, and the proposed method has good generalization ability for the description of relational scenarios outside the training distribution.
The researchers also asked participants to evaluate the matching of the generated images and scene descriptions. In the most complex example describing a relationship involving three objects, 91 percent thought the model performed better than other baselines.
In the presentation of Interactive Demo on the model code web page, it is clear that the new model can still accurately generate the image we want in the multi-layered object position.
OpenAI trained neural network model "DALL · E", which is also an image can be created based on various concepts of natural language based on text titles. DALL· E Although it is possible to understand objects very well, it is not possible to accurately understand the relationship between objects.
It can be said that the new model proposed by Nan Liu et al. is very robust, especially when dealing with scene descriptions that have never been encountered, and other algorithms can only look back.
While the early experiments worked well, the researchers hope the model will be able to perform tasks further in more complex real-world scenarios, such as when there are noisy backgrounds and objects blocking each other. It goes a step further and allows robots to infer the spatial relationships of objects through video, and then apply this knowledge to interact with objects in their surrounding environment. ”
Josef Sivic, a distinguished researcher at the Czech Institute of Informatics, Robotics and Cybernetics at the Czech Technical University, said: "Developing the ability to understand the relationship between things and to recognize new things through constant combination is one of the most important open issues in the field of computer vision. The results of their experiments are truly amazing. ”
About the author
Nan Liu, M.A., University of Illinois at Urbana-Champaign. Graduated from the University of Michigan in Ann Arbor in 2021 with a bachelor's degree in computer science. He is currently engaged in computer vision and machine learning.
Shuang Li, Ph.D., CSAIL, MIT, studied under Antonio Torralba. His research focuses on the use of language as a communication and computing tool and the construction of agents that can continuously learn and interact with the world around them.
Yilun Du, a PhD student at MIT CSAIL, was mentored by Professor Leslie Kaelbling, Professor Tomas Lozano-Perez, and Professor Josh Tenenbaum. He is interested in building intelligent beings who can perceive and understand the world like humans and in building modular systems. He won a gold medal in the International Biology Olympiad.
Joshua B. Tenenbaum is a professor in the Department of Brain Cognitive Science at MIT and a CSAIL researcher. He received his B.A. in Physics from Yale University in 1993 and his Ph.D. from the Massachusetts Institute of Technology in 1999. Known for his contributions to mathematical psychology and Bayesian cognitive science, Tenenbaum was one of the first to develop and apply probabilistic and statistical modeling to the study of human learning, reasoning, and perception. In 2018, R&D magazine named Tenenbaum "Innovator of the Year." The MacArthur Foundation awarded him the MacArthur Fellowship in 2019.
Antonio Torralba, Chair of the Department of Artificial Intelligence and Decision Making in MIT's Department of Electrical Engineering and Computer Science (EECS), Principal Investigator at CSAIL, Head of the MIT-IBM Watson AI Lab, and 2021 AAAI Fellow. He received a degree in Telecommunications Engineering from Telefónica BCN in 1994 and a Ph.D. in Signal, Image and Speech Processing from the National Polytechnique Nationale de Grenoble in 2000. He is associate editor of the International Journal of Computer Vision and served as Program Chair of the Computer Vision and Pattern Recognition Conference in 2015. He received the National Science Foundation Career Award in 2008, the Best Student Paper Award at the IEEE Conference on Computer Vision and Pattern Recognition in 2009, and the Aggarwal Award from the JK International Association for Pattern Recognition in 2010. 2017 Frank Quick Faculty Research Innovation Fellowship and Louis D. Smullin Award for Excellence in Teaching.
Reference Links:
https://news.mit.edu/2021/ai-object-relationships-image-generation-1129
https://openai.com/blog/dall-e/
https://composevisualrelations.github.io/
https://arxiv.org/abs/2111.09297
Leifeng NetworkLeifeng Network