Yi Ge is from Oufei Temple

Qubits | Official account QbitAI

What can human cubs do at the age of 2, AI has only learned?

As early as 2017, some netizens complained: as long as a 2-year-old child has seen a photo of a rhinoceros, he can recognize a cartoon rhino with different poses, perspectives and styles in other pictures, but AI can't do it.

NVIDIA AI abstracts concepts from images and generates new images, a skill that young children have finally learned

Until now, this point has finally been overcome by scientists!

The latest research has found that as long as the AI is fed 3-5 pictures, the AI can abstract the objects or styles in the pictures, and then randomly generate new personalized pictures.

Some netizens commented: Very cool, this is probably the best project I have seen in the past few months.

How does it work?

Let's start with a few examples.

When you upload 3 photos of ceramic cats from different angles, you may get the following 4 new images: two ceramic cats fishing on the boat, ceramic cat school bags, Banks art-style cats, and ceramic cat-themed lunch boxes.

The same example is the artwork:

Armor Villain:

Bowl:

Instead of extracting objects from an image, AI can also generate new images of a specific style.

For example, in the following figure, AI extracts the painting style of the input image and generates a series of new paintings in that style.

What's even more amazing is that it can also combine two sets of input images, extract objects in one set of images, and then extract the image style of another set, and combine the two to produce a new image.

In addition to this, with this feature, you can also "get started" with some classic images and add some new elements to them.

So, what is the principle behind such a magical function?

Although in the past two years, large-scale text-image models such as DALL · E, CLIP, GLIDE, etc., have been shown to have strong natural language reasoning ability.

But here's one thing: If users come up with specific needs, like generating a new photo containing my favorite childhood toys, or turning a child's doodle into a work of art, these large-scale models are hard to do.

To address this challenge, the study presented a fixed, pre-trained text-image model and a small set of images (3-5 images entered by the user) describing the concept, with the goal of finding a single word embedding and reconstructing the images from the small set. Since this embedding is discovered through the optimization process, it is called "Textual Inversion".

Specifically, it is to first abstract the object or style in the user input image and convert it into the pseudo-word "S∗", at which point the pseudo-word can be treated as any other word, and finally generate a personalized new image based on the natural statement composed of "S∗", such as:

"A photograph of an S∗ on the beach", "an oil painting of an S∗ hanging on the wall", "an S1 ∗ in the style of S2 ∗".

It's worth noting that because this study applied a small, curated dataset, stereotypes are effectively avoided when generating images.

For example, in the following figure, when "doctor" is prompted, other models tend to generate images of Caucasians and men, while this model generates images that increase the number of women and other races.

At present, the code and data of the project have been open sourced, and interested partners can pay attention to it.

About the author

The paper was written by a team of researchers from Tel Aviv University and Nvidia, and the authors were Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or.

First author Rinon Gal, a PhD student in computer science at Tel Aviv University, studied under Daniel Cohen-Or and Amit Bermano, whose main research direction is to generate 2D and 3D models under reduced supervision conditions, and is currently working at NVIDIA.

Reference Links:

[1]https://textual-inversion.github.io/

[2]https://github.com/rinongal/textual_inversion

[3]https://arxiv.org/abs/2208.01618

[4]https://twitter.com/_akhaliq/status/1554630742717726720

[5]https://rinongal.github.io/

NVIDIA AI abstracts concepts from images and generates new images, a skill that young children have finally learned

How does it work?

About the author