Xiao Cha from OuFei Temple Qubit Report | Official account QbitAI
Following GauGAN2, Nvidia launched a "super suture" of GAN , PoE GAN.
PoE GAN can accept a variety of modal inputs, text descriptions, image segmentation, sketches, styles can be converted into pictures.
And it can accept any two combinations of the above input modes at the same time, which is the meaning of PoE.
PoE is hinton's concept of "product of experts" proposed in 2002, and each expert (individual model) is defined as a probabilistic model on the input space.
Each individual input mode is a constraint that must be met by the composite image, so a set of images that satisfy all constraints is the intersection of each set of constraints.
Assuming that the joint conditional probability distribution of each constraint follows a Gaussian distribution, the intersection distribution is expressed in terms of the product of the single-conditional probability distribution.
Under these conditions, in order for the product distribution to have a high density in one region, each individual distribution needs to have a high density in that region to satisfy each constraint.
The focus of PoE GAN is on how to mix each input together.
Design of PoE GAN
The generator for PoE GAN uses global PoE-Net to blend variations of different types of inputs.
We encode each modal input as a eigenvector and then summarize it into a global PoE-Net using PoE. The decoder not only uses the output of the global PoE-Net, but also directly connects the segmentation and sketch encoders to output the image.
The structure of the global PoE-Net is as follows, where a potential feature vector z0 is used as a sample using PoE, which is then processed by MLP to output the feature vector w.
In the discriminator section, the authors propose a multimodal projection discriminator that generalizes the projection discriminator to handle multiple conditional inputs.
Unlike a standard projection discriminator that calculates the individual inner product between image embedding and conditional embedding, the inner product of each input modality is calculated here and the phase is used to obtain the final loss.
Transform the input GAN at will
PoE can generate images with single-mode inputs, multimodal inputs, or even no inputs.
When tested using a single input mode, PoE-GAN outperformed previous SOTA methods designed specifically for that mode.
For example, in the split input mode, PoE-GAN is superior to the previous SPADE and OASIS.
In the text input mode, PoE-GAN is superior to text-to-image models DF-GAN, DM-GAN+CL.
When conditioned on any subset of patterns, PoE-GAN can produce different output images. A random sample of PoE-GAN is shown below, provided that two modes (text + segmentation, text + sketch, segmentation + sketch) are on the landscape image dataset.
PoE-GAN can even have no input, at which point PoE-GAN becomes an unconditional generation model. Below is a sample of the unconditionally generated poE-GAN.
Team Introduction
The corresponding author of the paper is Liu Huanyu, a well-known engineer of NVIDIA, whose research focuses on deep generative models and their applications. Interesting products like NVIDIA Canvas and GauGAN are all in his hands.
The first thesis is Huang Xun, who graduated from Beijing University of Aeronautics and Astronautics with a bachelor's degree and a Ph.D. from Cornell University, and now works at NVIDIA.
Thesis Address: https://arxiv.org/abs/2112.05130
PoE: https://www.cs.toronto.edu/~hinton/absps/icann-99.pdf
Projection Discriminator: https://arxiv.org/abs/1802.05637