Cressy from the temple of Wafei
量子位 | 公众号 QbitAI
A digital human that used to take two days to train can now be completed in just half an hour!
When it comes to the inference stage, it only takes 16 milliseconds to get a scene video with smooth action and good details.
And there is no need for complex sampling and modeling, just shoot a random video of 50-100 frames, which is converted into only a few seconds.
This is the 3D digital human synthesis tool HUGS based on Gaussian function launched by Apple and Max Planck of Germany.
It can extract a character skeleton from a simple video, synthesize a digital avatar and drive it to do whatever it wants.
The digital human can be silkily blended into other scenes, and even the frame rate can surpass the original footage, reaching 60 FPS.
Hugging Face的“首席羊驼官”Omar Sanseviero看到后,也给HUGS送上了hug。
So, what can HUGS achieve?
Generate 60FPS video at 100x speed
As you can see from the GIF below, the newly generated digital human can perform different actions in a scene that is different from the training material.
The newly synthesized footage is also much smoother than the original footage — even though the original footage is only 24 FPS, the HUGS composite video reaches 60 FPS.
At the same time, HUGS also supports merging multiple characters into the same scene.
细节刻画上,HUGS也比Neuman和Vid2Avatar这两个前SOTA更清晰细腻,也更加真实。
If you put it in the spec space, the contrast of details between Neuman and HUGS becomes more obvious.
In terms of test data, the PSNR and SSIM scores of HUGS in the five scenarios of the NeuMan dataset reach the SOTA level, and the LPIPS error is at the lowest level.
On the ZJU Mocap dataset, HUGS also surpassed the baseline methods such as NerualBody and HumanNeRF for 5 different subjects.
In terms of speed, HUGS training can be completed in just half an hour, while the previous fastest VidAvtar also takes 48 hours, which is nearly a hundred times faster.
The same goes for rendering speeds, which take 2-4 minutes to render with the baseline method, but HUGS takes only 16.6 milliseconds, which is faster than a human blink. (The figure below shows the logarithmic coordinate system)
So, how does HUGS achieve the generation of 3D digital humans quickly and delicately?
Render like building blocks
HUGS first transforms the characters and scenes into 3D Gaussian spots, respectively.
Among them, the Gaussian blob in the human part is predicted by three multilayer perceptrons (MLPs) and initialized by SMPL, a human-shaped model.
SMPL can create a mapping of a physical person to a 3D mesh with a very small number of parameters, and only 10 main parameters are needed to represent 99% of the human body shape changes.
At the same time, in order to portray details such as hair and clothing, HUGS also allows the Gaussian function to deviate from the SMPL to a certain extent.
The Gaussian spots of the scene are encoded by the position provided by the feature triplane, which is predicted by multiple MLPs.
After obtaining the Gaussian blobs of the human body and the scene model, the researchers jointly optimized them.
The resulting Gaussian blobs are also cloned and split to increase the density of the blobs and get closer to the real target geometry, a process known as Densify.
In addition, the researchers also introduced the Linear Hybrid Animation (LBS) technique to drive Gaussian spots during movement.
After converting to the Gaussian blob form, the researchers trained a neural network to predict the properties of the Gaussian function to form a real human shape.
At the same time, the neural network also defines the binding relationship between the Gaussian function and the human skeleton, so as to realize the movement of the character.
In this way, the rendering process of HUGS is like building blocks, and there is no need to re-call the neural network, which enables high-speed rendering.
The results of ablation experiments showed that LBS, Densify and triplanar MLP were all important links in HUGS, and the absence of any one of them would affect the synthesis effect.
The joint optimization of characters and scenes is also a key factor to achieve just the right fusion effect.
One More Thing
It's been a while since Apple came up with the idea of researching digital humans.
In Apple's MR headset Apple Vision Pro, there is a high-detail version of the concept of digital clones -
During a FaceTime call, the headset can create a "digital human" and use it to represent the user.
So, what do you think of Apple's "digital human generator"?
Address:
https://arxiv.org/abs/2311.17910
Reference Links:
[1]https://appleinsider.com/articles/23/12/19/apple-isnt-standing-still-on-generative-ai-and-making-human-models-dance-is-proof
[2]https://twitter.com/anuragranj/status/1737173861756485875/
— END —
量子位 QbitAI · 头条号签
Follow us and be the first to know about cutting-edge science and technology