laitimes

The manual past life of the virtual person and the ai of the present life

The manual past life of the virtual person and the ai of the present life

Write on the front

This time, let's have a somewhat cross-border theme: three parts to explore virtual people, Unreal Engine, and how the two will affect independent film and television creation.

The three articles are expected to be:

First, the manual past life of the virtual person and the ai in this life

Second, how Unreal has become the most suitable 3D engine for film and television

The impact of virtual people and Unreal Engine on independent film and television creation: possible and impossible

"Independent" here refers to individuals and independent studios with the ability to control.

Just like the once brilliant small and medium-sized computers were eventually replaced by micro-personal computers in the mainstream market position, seven or two may not be unable to dial thousands of pounds. Is it possible for independent creators, with the blessing of new technologies, to leverage the current film and television creation ecology?

And let's take it slowly.

The author prefers to define the scope of the discussion first, and this time is no exception: what kind of virtual person is discussed in this article?

Virtual people are the current hot word, so it has also become a standard basket concept: anything can be loaded into it, such as two-dimensional anchors, enterprise artificial intelligence customer service, NPC of the meta-universe... They can all call themselves virtual people.

The real focus of this series is on those "photo-level" hyper-realistic virtual people. We expect this kind of virtual person to replace the real actor to some extent with fake real people, and participate in film and television creation in the future.

Several eras of virtual people

At present, there is a more general four-stage division of the development history of virtual people on the Internet, and it is not known which big guy defined it. Let's follow this division and start with a brief review.

The first stage: the 1980s, the budding period, the first generation of singers

Let's start with Lin Mingmei: the recognized first generation of virtual singers, the first virtual idol.

For anime fans, Lin Mingmei has a more familiar identity, that is, the heroine of the classic Japanese animation film "Chrono Fortress" in the 80s, and her famous song "Can Ever Remember Love" was sung at the climax of the 84 theatrical version that condensed the plot of the anime series.

"Remember Love" was sung to the entire battlefield when the Chrono Fortress Macross launched a general attack on the invading alien legions. As the number one singer of mankind, Mingmei sang this song when human civilization was facing the devastating blow of alien enemies, which stimulated the fighting spirit of everyone, and even made some alien armies resonate and revolt, turning the tide of the war in one fell swoop. The male protagonist, Ichijo Hui (also an ex-boyfriend...) In the song of Mingmei, the pilot of the morphing fighter finally burst into the enemy's hinterland, giving the alien grand leader a fatal blow.

This 6-minute complete Live in Star Wars, the interweaving of war and singing, the ultimate magnificence and romance, left a classic scene that can never be surpassed in the history of animation. It can be said that after the song of Mingmei, there is no virtual singer in the cartoon, and the debut is the peak.

Lin Mingmei was the beginning of a virtual idol, the animation company released a record with her avatar, and the virtual person entered the real world for the first time. After nearly 40 years, the image of Mingmei is still deeply rooted in the hearts of the people.

From the perspective of animation standards, the pure hand-drawn animation of the peak of the year has now been completely replaced by the trend of 3D animation, but the reasons why Mingmei has been talking about so far have nothing to do with technology, purely based on the character setting and the magnificent story background set off by the "Cosmic Song" symbol.

Therefore, it is quite appropriate to use Mingmei as the beginning of this series:

Lin Mingmei is a virtual idol who has come out of film and television. It can be said that Mingmei gives us a revelation: a successful virtual idol, in addition to technical support, more importantly, he/she must have the soul given by the work.

Phase II: Early 21st Century Exploratory Period The test of film and television entertainment

The time jumped to 2000.

(From this time jump, we can also feel how cattle the original singer Hime Hayashi Akimi is...) )

During this period, virtual people finally got rid of the limitations of traditional hand-drawn, and the first generation of 3D virtual human idols began to appear, more representative of Hatsune Mirai, which appeared in Japan in 2007.

At this stage, the virtual human idols are all simple two-dimensional images, which also fit the CG level of 3D at that time: complex is not good, and it is appropriate to do some simple two-dimensional images.

Interestingly, Hatsune Mirai was actually a song synthesis software, and crypton FUTURE MEDIA at that time developed a sound source library based on Yamaha's Vocaloid series of speech synthesis programs, and produced and released a virtual singer series of singers, and Hatsune Mirai also entered the public eye.

China's virtual Kaji Luo Tianyi, which benchmarked Hatsune Future, appeared in 2012.

It is worth mentioning that the key to the development of virtual singers has been proved to be the UGC created by fans.

At that time, after the release of Hatsune Mirai, a large number of cover songs were published on the Japanese forum; the company then opened up the secondary creation rights to encourage fans to create; coupled with its own lack of detailed settings, a large number of fan-made art settings/audio/video content appeared on the Japanese UGC website, and even some "big god" level fan creators were born, which greatly enriched the content connotation of Hatsune, and let Hatsune Future enlarge the fan circle layer, and repeatedly went out of the circle with the brainwashing divine comedy created by fans.

China's Luo Tianyi also went out of a similar path, after several years of poor PGC (professional content creation) operation, the company began to encourage UGC creation, a large number of fan works emerged, Luo Tianyi, the virtual idol really stood firm.

During this time period, a famous virtual monster also appeared in the film industry, which was Gollum (actually more appropriately called a virtual monster) in the 2001 movie The Lord of the Rings.

Gollum is produced entirely by motion capture technology and CG technology, and the unprecedented motion capture plus 3D CG image synthesis effect has amazed the world. Motion capture-based 3D screen images are commonplace now, but Gollum was arguably a sensation in the film industry.

In 2008, the divine work "Avatar" reached another height: the whole process of using motion capture technology to complete the performance, CG technology to create the entire virtual world, this is the last word.

Let's dig deeper, why did the first development of virtual people start around 2000? What happened at that point in time?

In 1999, NVIDIA released its iconic GeForce256, a family that officially supports T&L functions (coordinate conversion and lighting) that are particularly important in 3D graphics operations.

You know, the solution of 3D graphics is composed of a variety of complex coordinate transformations and light source calculations. Prior to the GeForce256, all coordinate processing and lighting operations were handled by the CPU; when the graphics chip had T&L capabilities, the CPU was completely freed from heavy graphics computing.

From then on, graphics chips can really be called GPUs, and THEY are on par with CPUs.

Perhaps due to the rapid development of the 3D graphics acceleration ability of personal computers, the use scenarios and capabilities of 3D graphics acceleration have been popularized to the public, bringing about the first development of 3D virtual people. During this period, the market size is still relatively small, and key technologies such as motion capture and CG are not mature enough, nor are they supported by enough resources, which can be said to be the test stage of the entertainment industry on virtual people.

The third stage: 2016-2020 growth period Technology breakthroughs brought about by the popularization of applications

The years from 2016 to 2020 are generally divided into the third stage of virtual people.

What virtual people have appeared in the past few years? In fact, there is not much to remember. It must be mentioned that in this time, the birth and development of the world's first virtual anchor, that is, "trip love" on YouTube.

On December 1, 2016, the YouTube channel "A.I.Channel" was opened, and "Trip Love" became the world's first video blogger to claim to be a virtual anchor, confirming the concept of virtual anchor VTuber and opening the era of two-dimensional style virtual people.

Within 3 months of its debut, the number of YouTube followers exceeded 200,000, and by July 15, 2018, the number of subscribers to the main channel of Tripping Love exceeded 2 million. By 2022, its main YouTube channel and game crossover fans total more than 4 million.

However, just on The recent February 26 of this year, "Trip Love" announced that it had entered "indefinite hibernation" after holding an online concert "Hello World 2022". The original virtual anchor completed the complete acting career cycle.

There's nothing new behind the rise and fall of virtual stars than real-world stars. It is nothing more than the loss of fan attention, the fragmentation of the fan base, the company's operational mistakes, and so on.

Virtual idols have also stepped out of the path of personification in showbiz.

And what happened technically in the last few years?

In addition to modeling in the field of 3D, rendering and motion capture technology is increasingly mature; these years are also a period of explosive growth of artificial intelligence technology based on deep learning.

In March 2016, ALphaGo, a deep learning-based Go program based on deep learning in the United Kingdom, defeated the top professional chess player Lee Sedol 4:1, becoming the first computer Go program to beat the professional nine-dan. As an iconic event, artificial intelligence has entered the public eye and triggered the full-scale fire of AI.

At this stage, AI capabilities have also begun to be applied to virtual people, mainly reflected in the combination of AI voice capabilities and virtual people's service images, such as the AI host launched by Sogou and Xinhua News Agency in 2018, and the digital employee "Xiaopu" jointly developed by SPDB and Baidu in 2019.

Phase IV: 2020 to the present, new era, new atmosphere

In the past two years, "virtual people" have become a hot topic. In the current context, most of the "virtual people" understood refer to the so-called "photo-level" hyper-realistic virtual people.

There may be several reasons why the current standard of "virtual person" has become so high:

The development of hardware and software is finally enough to support this ultimate level;

Users' tastes are more picky, and typical cases have also raised public expectations;

Capital also needs the virtual human industry to tell a new story.

At the end of the day, the most important thing is that users expect this kind of virtual human experience that is fake and real. Hyper-realistic virtual characters have a better sense of substitution and can better access real-world business information.

The boom of virtual people will become more and more obvious in 2021, and companies in different fields of the Internet are trying to develop virtual people business:

In May 2021, AYAYI hyper-realistic digital people were launched, and currently There are 12.6w of Xiaohongshu fans and 8.3w of Douyin fans.

In June 2021, Station B announced that more than 32,000 virtual anchors had started broadcasting on Station B in the past year. Virtual anchors have become the fastest growing category in the field of B station live broadcasting. The new generation of virtual anchors is more diverse and grounded.

On November 18, 2021, NVIDIA launched omniverse Avatar, a full-scale avatar platform, and CEO Jen-Hsun Huang demonstrated the "mini toy version of Huang Jenxun" Toy-Me generated by this platform that can communicate with people and natural questions and answers.

At the New Year's Eve party on December 31, 2021, a number of mainstream satellite TV platforms have introduced virtual person elements, the most brilliant of which is "The Story of a Small Town" sung by Zhou Shen and the virtual person "Teresa Teng".

Regardless of the final effect of each program, multiple virtual idols landed on the mainstream New Year's Eve party at the same time, which in itself explains a lot of problems.

To say that the hottest virtual person idol in China at present is Liu Yexi.

On October 31, 2021, Douyin beauty virtual anchor Liu Yexi, the first video was released on the hot search, increasing the number of fans by millions; as of now, the number of Douyin fans has exceeded 9 million, and it is still growing significantly.

Such a ferocious fan growth is really eye-catching, on the other hand, under a good situation, Liu Yexi has a small hidden danger, that is, her video release frequency is relatively low. This is determined by the video production threshold and cycle of hyper-realistic virtual people.

The team behind Liu Yexi has more than a hundred people, two-thirds of which are content creation departments, such a strong and professional team, each 3-4 minutes of high-quality video content, basically a monthly speed. This is almost the current limit level, and the ultra-long production cycle can easily make virtual idols miss the traffic promotion node.

Imagine if Liu Yexi's video output speed was tenfold faster?

If there was a technology that allowed the team to output a willow-level short drama video every three days, the film and television industry might have to be rewritten.

While this can't be done overnight, the evolution of computer technology is rapidly lowering the bar for virtual human content production. This day may come sooner than we think.

Next, let's take a moment to review how technology has propelled virtual people to where they are today.

The evolution of virtual human production methods

Continuing from the previous section, let's first look at the current production costs of virtual people.

According to industry data, the current cost of creating a virtual idol with a Q version or a two-dimensional image is 100,000;

If it is a virtual idol with a beautiful and realistic style, the cost will reach 400,000;

If you want to make a hyper-realistic virtual person similar to Liu Yexi, it is said that the industry price is in the millions;

The cost per minute of virtual human animation also corresponds to tens of thousands to nearly one million.

This kind of cost is not something that independent content creation can afford. Is there room for independent workers here? What a cool thing if everyone were free to create their own hyper-real virtual people.

We may wish to delve into the various production links of virtual people to explore the end:

Create 3D portraits with fake realities: 3D carving knives, camera arrays and light field reconstructions, face pinching games

3D engraving knife

To create the appearance of a virtual person, the jargon is called the modeling of the virtual person.

The most traditional way to model 3D is to build it by hand.

Similar to the real world, the sculpting tools of the 3D world, such as ZBrush software, are carved little by little. Essentially this is a masterpiece by computer 3D artists.

In fact, the various monster characters in the game world full of imagination and tension details are slowly polished by hand after making the original art setting.

The ingenuity of 3D artists brings to life characters that only exist in the imagination to the user audience.

Camera array scanning and dynamic light field reconstruction

Unfortunately, it is difficult to meet the efficiency and magnitude required for industrial production by relying on the hands of artists alone, and it is very expensive to manually create high-precision virtual human models.

In order to popularize the production of 3D virtual people, some smart people continue to explore more efficient ways to produce models.

The most direct idea is 3D scanning acquisition.

There are currently two main 3D scanning methods: camera array scanning and dynamic light field reconstruction.

Two somewhat unfamiliar words, we looked at them one by one.

The first is camera array scanning, what is this technology?

In March 2021, Epic announced on its website that it had acquired a company called Capitalizing Reality.

This is a "photographic cartography technology" company. Their family has developed a somewhat magical software, Reality Capture: as long as the user takes a lap of the object with the mobile phone, all the photos are entered into the software calculations, and a 3D model of the object can be generated.

This photographic mapping method is currently the most mainstream solution for face modeling in the industry, users can only use a camera to complete the quality of the model scan, you can also spend effort to form a camera light source array to achieve high-precision shooting reconstruction.

It is not difficult to understand that this method uses the same feature points between different pictures to reconstruct 3D space. Therefore, the resolution of the photo, the control of the internal and external parameters of the camera, the uniformity of the face light and other factors will affect the final model quality, the need for a relatively ideal shooting environment to shoot, the emergence of some specialized scanning technology providers in China, to undertake a lot of film and television business.

Although this method is relatively simple, it has limitations in modeling details. The most obvious point is that photographic cartography will have a feeling of inadequacy when reconstructing the skin details of virtual people.

If we put the camera on the virtual face, the flat skin of the character will immediately reveal the horse's feet.

If the virtual person is just doing live broadcasting, or interacting in the game; if you want to create a film-and-television level hyper-realistic virtual person, the face close-up is a must, and the skin realism cannot be bypassed in any way.

So is it possible to reproduce the real skin details on the model with the artist's hands on top of the 3D model reconstructed by the photographic cartography method? The answer is no, although 3D artists have superb skills, but the real skin is like a peak in front of them, difficult to overcome.

Why is it so hard to create real-textured skin by hand?

This is because human skin has a particularly complex performance in detail. Completely different from those slippery virtual human face skins, human real skin has considerable complexity:

After daylight enters the human body, different wavelengths of light will have different absorption, corresponding to different scattering distances and decay rates; in addition, the light entering the skin is also affected by the transmittance of the skin surface, pores and wrinkle structures and even subcutaneous tissue structure.

Therefore, relying on the artist's handiwork has not been able to restore the photorealistic realism of the skin, and it is too difficult to create a skin with subtle changes in the pore structure by hand.

This problem did not turn around until 2008, when the technical house invented a black technology that can reconstruct the skin of the face in high precision in three dimensions, and on top of the accurate geometric model of the face, accurately generate the wrinkles on the surface and the structural details of each pore (sigh, too cattle... At the same time, the decay of different wavelengths of light in the subcutaneous tissue is also described by a physical formula, and finally the "photo-level" real skin texture is obtained.

This is the well-known LightStage in the film and television industry, perhaps the most cattle face scanning technology on the planet, and the most representative system implementation of light field dynamic reconstruction technology in the industrial world.

LightStage is a 3D acquisition reconstruction platform system led by Paul Debevec of the ICT Graphic Lab at the University of Southern California. The first generation of the system was born in 2000 and received a lot of attention from the first day of its birth, and has now evolved to LightStage6, and the latest generation of systems is named LightStageX.

LightStage's light field scanning technique has produced some top graphics papers. For the average reader, only one thing needs to be understood: LightStage light field scanning collects lighting data from various angles of the face (as shown above), and finally calculates and restores the ultra-high-precision face surface information.

This technique is a good way to reconstruct the structure of each pore on the human face. The face of former US President Barack Obama reconstructed by LightStage scanning, with the pore details on it, is truly stunning.

Interestingly, both the photographic cartography and the Big Killer LightStage use similar spherical camera arrays to capture information obtained from different angles of faces. But the obvious difference is that the former shoots with soft and uniform light to avoid uneven highlights and light and dark, while the latter specially shoots light and dark faces. The concept of lighting for the two systems is in opposite directions.

Therefore, looking at the way the array is used when shooting, the reader can distinguish at a glance whether the photographic mapping technology behind the camera array is used or the dynamic light field reconstruction technology.

When LightStage2 took shape, Scott Stokdyk, head of visual effects at Imageworks at Sony Pictures, in collaboration with the Light Stage team, used this technique to create virtual stand-ins for actors Alfred Molina ("Dr. Oak" Doc Ock) and Toby Maguire ("Spider-Man") for the movie Spider-Man 2. The technology, used in nearly 40 shots, helped the film win the 2004 Academy Award for Best Visual Effects Achievement.

LightStage2 has been used in a wider range of film productions, including the 2005 film King Kong and the 2006 film Superman Returns, which used LightStage2 scanning to create a virtual digital superman character that was used in many action action shots, and this virtual digital superman helped the film win an Oscar nomination for Best Visual Effects.

The manual past life of the virtual person and the ai of the present life

"Superman Returns"

The mature LightStage5 is widely used in various Hollywood blockbusters, many of which are familiar, such as "The Wonders of Benjamin Button", "Spider-Man 3", "Avatar"...

Strictly speaking, the core technology of LightStage does not measure the geometric structure, or it depends on a method similar to photographic mapping to obtain an accurate 3D model of the portrait, and then uses the photometric stereoscopic method to calculate the high-precision details of the model surface - this is why Light Stage can restore the details of the skin pore structure. Because of this, LightStage scanning technology can be favored by many Hollywood blockbusters.

However, although LightStage is a mature system with public papers, its core model surface high-precision detail calculation has no public solution, and many algorithm details are unknown, resulting in the emergence of this level of scanning technology in China for a long time.

Due to the lack of core algorithms, most of the spherical scanning systems on the domestic market still use the photographic mapping methods mentioned above, and the use of ball arrays is only to control the uniform lighting and camera calibration, etc., such a system cannot match the LightStage in the accuracy of the most critical skin details.

(With the reality Capture software upgrade, photographic mapping is now barely close to pore-level detail, and is an affordable way to reconstruct.) )

Recently, several domestic companies have studied the implementation of a system similar to Light Stage, hoping that the domestic three-dimensional face reconstruction with light Stage level will be used as soon as possible.

In addition to Light Stage, there is another dynamic light field reconstruction concept, this so-called "light field imaging" idea is simpler and cruder: regardless of the object model and surface material, directly from all angles to collect the light reflection information of three-dimensional objects under various conditions, and then recombine the collected light output during rendering, you can let people see the "real" three-dimensional world.

Have you noticed that the so-called top-level method of creating a "realistic world" has finally returned to the basics:

Collect all the information as much as possible, reorganize the calculation and output it.

Whether it is three-dimensional reconstruction, or the various virtual human-driven methods based on big data fusion in the future, the core idea is the same: from reality, back to reality.

Big data-blessed face-pinching game

It took some space to introduce the most powerful three-dimensional reconstruction technology on the planet, but the conclusion was a bit helpless: the cost and threshold of such face scanning and reconstruction were too high, and independent film and television creations did not need to count on this nuclear bomb-level system.

But it doesn't matter, we also have big data and artificial intelligence.

The idea here is also very simple, although there is no tall system to directly scan real people, but if you can use ready-made scanning data, combined with an interactive system similar to the game face pinch, can you provide photo-level hyper-realistic virtual person generation services for ordinary users?

Someone did, and that's Unreal's MetaHuman Creator.

(Yes, it's unreal again)

The manual past life of the virtual person and the ai of the present life

MetaHuman Creator interface

It is worth mentioning that MetaHuman Creator is a cloud rendering service, users connect and interact through web pages, all data operations and generations are rendered by cloud services; and the cloud background uses Unreal Unreal Engine itself.

MetaHuman at first glance seems like a very simple system, and it's a bit of a game to create a character pinch face.

But behind the simplicity is actually the crystallization of the latest technology in many computer disciplines:

Ultra-large-scale 4D human face scanning, machine learning-based data processing and fusion, real-time 3D engine support, cloud rendering services... The wisdom of countless computer scientists and engineers has created a hyper-realistic virtual human-generated system that ordinary people can use.

(Note: No more information has been disclosed for the time being, but from the data results, the MetaHuman 4D scan data should be similar to LightStage's light field scan reconstruction)

In fact, the current MetaHuman big data-based model fusion only completes the face part - of course, this is also the most difficult part; while the body part only provides the traditional basic style selection. The reason is simple, there is no big data support for scanning the whole body model.

Nevertheless, MetaHuman Creator implements a hyper-realistic virtual human generation service for ordinary consumers (it is actually free, and the virtual human generated online can be directly exported and used by data), which is already a great thing.

It is no exaggeration to say that MetaHuman Creator has achieved a technological breakthrough in the production of virtual people, greatly simplifying the creation process of hyper-realistic virtual people, and to some extent, making virtual people production really fly into the homes of ordinary people.

If you continue to think for users, how do ordinary people design a handsome/beautiful virtual person, perhaps pinching the face of a celebrity photo is a way.

But such things as celebrity faces often involve portrait copyright; in film and television production, copyright is an important issue.

Is there a way to generate a beautiful face on its own?

In the small detail of face creation, the artificial intelligence blessed by big data has once again demonstrated its great power. With deep learning techniques, we can already generate faces with various stylistic tendencies. The following are some of the oriental star faces and european and American ordinary faces randomly generated by the author based on the publicly available deep learning model.

The manual past life of the virtual person and the ai of the present life
The manual past life of the virtual person and the ai of the present life

The above are pure computer-generated face pictures, but mixed in real photos, it is absolutely true and false

If one day, MetaHuman Creator adds a function that uploads face photos to automatically learn and match face pinches, it will really be a "one-click hi mention virtual person".

By the way, the aforementioned virtual human AYAYI was created by MetaHuman Creator.

At this point, the story of creating a virtual human model is almost the same.

After creating the 3D virtual human model, there was actually a very challenging task - to properly bind the various parts of the character model to the controllers that were later used to drive actions and expressions. It's like properly connecting the skin to the muscles and bones underneath, so that static models can be driven.

In this regard, a face-pinching system like MetaHuman is slightly better, after all, it is derived from the same original model, and the internal driving mechanism can be uniformly done; and for the hyper-realistic model scanned by the direct camera array, binding is a very heavy task; this field also has the blessing of big data and AI technology, and will not be repeated.

Get virtual people moving: keyframe animation, motion capture, AI-driven

Keyframe animations

For a long time, we drove a 3D model, whether it was a virtual person or a virtual monster, and the way to make the model move was through keyframe animation.

Keyframe animation is an easy-to-understand concept, somewhat similar to clay stop-motion animation, the clay doll posed in an action to shoot a frame, after the continuous posing is completed and then played at a speed of 24 frames per second, you get a coherent clay animation. Obviously, creating an animation like this requires amazing patience.

The mechanism of keyframe animation and clay animation is very similar, the reader can understand that the real clay doll is replaced by a 3D model in the software, the producer puts out the actions (keyframes) of the model on the timeline, and the software realizes the transition between the actions, thus creating a keyframe animation of a 3D character.

People have thought of many ways to improve the efficiency of keyframe animation, but in essence, keyframe animation is still manually cut out by 3D animators frame by frame.

Motion capture

As with 3D modeling, manual keyframe animation is not acceptable to industrialized mass production processes in terms of productivity, output quality, and labor costs. Since the gollum of "The Lord of the Rings", motion capture technology has entered the public eye.

As the name suggests, "motion capture" is the direct capture of the action of the collector performer, and then match the drive virtual person model. This is currently the main way virtual human action generation.

There is an interesting route distinction in the motion capture thing, which is somewhat similar to the route battle of the self-driving car perception system:

At the heart of autonomous driving is to perceive its surroundings through sensors in two ways: pure vision cameras and lidar. Which route is better, the visual faction and the radar faction are still very happy.

Motion capture, also divided into two major factions, optical motion capture and inertial motion capture.

Optical motion capture is a circle of cameras on the shelves around a studio, 360 degrees without dead angles to shoot at the performers; the performers are marked with many markers of reflected infrared light, and through the synchronous tracking of the reflective points by multiple cameras, the computer calculates the actor's movements.

Inertial motion capture is to tie an inertial measurement element (accelerometer + gyroscope + magnetometer, etc.) to a specific bone node of the human body, and calculate the measurement value of the sensor to complete the motion capture.

Optical motion capture is the main production method in the current film industry because the accuracy is high enough.

But the problem is: although the effect is very good, optical motion capture is of little significance for popular popularization. It is difficult for the average person to have such an expensive optical motion capture studio, and the demand for venues and equipment is doomed to be a relatively sunny and snowy technology.

Inertial motion capture is much less expensive. At present, the cost of tens of thousands of yuan can have a full body inertial motion capture equipment, including motion capture gloves, which is not a problem for internet celebrities, and ordinary creators can afford to bite their teeth.

Compared with the relatively low price, more importantly, the inertial motion capture equipment has no requirements for the size of the site.

At present, it is not uncommon for some high-end virtual anchors to use inertial motion capture equipment to do live broadcasting.

However, inertial motion capture has a small problem, as the continuous use time increases, the measuring element will produce cumulative errors, so it needs to be recalibrated after a period of use. In addition, although it has been relatively friendly, inertial motion capture is still not particularly convenient, such as the need to tie the measurement unit to the whole body when using, and the need to avoid the influence of magnetic fields in the environment...

Technical houses that want to be lazy will not be completely satisfied.

There must be readers thinking, our personal computers have cameras, if you do not have to set up a room of cameras, but only rely on one or two cameras, just like people's eyes, you can accurately identify the action, to achieve motion capture how good.

In fact, many people have this idea, and the real practitioner is the famous Microsoft, the product is the Kinect that combines optical and depth cameras launched on xbox.

In the large-scale production of Kinect equipment and the corresponding algorithm processing, Microsoft has invested huge human and financial resources. Unfortunately, after two generations of XBOX practice, Kinect was eventually abandoned. According to the data, Kinect's cumulative sales exceeded 35 million units. So Kinect can't be said to be a failed product, at least it is credited with expanding the influence of the XBOX brand in the first and middle terms. But in the end, Kinect's ambition was not rewarded, and for many students at Microsoft, it was regrettable.

Kinect's demise continued to shine after it left the market. PrimeSense, the original technology provider of Kinect (which Microsoft later improved on its own), was acquired by Apple in 2013 for $360 million three years after Kinect went public. Therefore, it is not surprising that the current Apple mobile phone has a built-in depth camera, and FaceID has the same points as Kinect in principle.

In the console game entertainment system industry, there was also a dispute between Sony PS VR controller somatosensory and XBOX's Kinect somatosensory, that is, the dispute between inertia and visual routes. In this area, the final visual route fails.

In addition to the problem of user interaction experience, Kinect is also technically limited by the hardware performance of that year, after all, it is only an accessory device of a home entertainment system, Kinect's camera resolution, computing power limitations and memory limitations on XBOX all restrict its accuracy of human motion recognition.

Despite Kinect's success, single-camera-based visual motion capture applications continue to evolve. For individual users, using a single camera for motion capture is a very practical requirement.

Today, face and bust motion capture based on a single camera are already standard in some two-dimensional virtual live broadcast software and short video APPS. But to be fair, these visual motion capture applications, currently just toys, entertainment OK, can not meet the precision requirements of industrial production.

Why is it a toy, for a simple example: there is still no commercial visual motion capture software on the market that can capture the movement of the ten fingers of the hands very well (if it has appeared, please correct it). Without the ability to capture the details of limb movements, visual motion capture tools cannot enter the production field.

Excitingly, however, the further combination of big data, deep learning, and computer vision offers many possibilities. It is reported that in the research institutions of some large factories, it has been possible to see the way of large database + deep learning visual recognition based on gesture action to obtain very accurate single-camera gesture recognition results.

Optimistically, in the next year or two, we may be able to use industrially accurate single-camera visual recognition products.

For independent film and television production, the demand is simple and easy to use accurate AND OK live-action capture, optical motion capture is too luxurious. At present, the preferred method is inertial motion capture equipment, the price is acceptable, and the effect is acceptable.

The ideal way the author expects will be a combination of inertial motion capture + single-camera visual recognition.

This combination of soft and hard systems is cost-bearable on the one hand; on the other hand, two independent capture systems can be cross-referenced and corrected. for more precise motion capture.

AI-driven

The technology of motion capture is getting better and better, but the realm of the lazy is endless:

Can you even save the need to capture this step and use artificial intelligence to drive the movements of virtual people?

This step also has some practices, such as Baidu's AI sign language anchor, which is a typical AI-driven action virtual person.

AI-driven virtual people have very practical significance for the film and television creation that this series focuses on:

With AI-driven virtual people, in future film and television performances, the virtual supporting roles of the dragon running set can be driven by AI.

The director only needs to focus on the performance of the virtual protagonist. The performer transmits the body movements (and of course, expressions and language) that the director wishes to express to the virtual person character through motion capture technology; and the AI virtual crowd performance of soy sauce only needs to be arranged properly through preset instructions, and even further even only needs to directly analyze the intention of the script with the help of natural semantic understanding technology, which can interact with the performance of the protagonist virtual person.

It sounds a little bit of science fiction, but each of the links described here doesn't have particularly hard bones to gnaw on.

Here, the challenger's question may not be how semantic instructions can be understood by the virtual human AI and then turned into the output of the performance — supplemented by interactive tuning. We don't expect AI virtual people to really understand performance, but only require AI to be able to interact with virtual people captured by real people after receiving corrections from several key points.

The real difficulty may still lie in whether the performance of the virtual person is natural enough to deceive the audience's eyes.

At present, the virtual human AI action drive we see is still in a relatively primary state. However, the authors optimistically believe that similar to Metahuman's emergence, as long as it is supplemented by a sufficiently large human action database, it is only a matter of time before the virtual human action is driven by fake real AI.

At last

On the topic of driving virtual people, in fact, there are two aspects that are not involved, one is the voice ability of virtual people, and the other is the expression drive of virtual people.

Regarding the former, with the support of the current large language model with a scale of 100 billion parameters, it is almost impossible for virtual people to distinguish between true and false text communication based on AI. Based on the generation of chat text to natural speech, there are already good solutions.

In other applications, such as when creating interactive metaversms, an AI system that can talk autonomously is more important; back to our film and television creation theme, whether a virtual person can chat autonomously is less critical. More importantly, the virtual person can perform according to the script (reading emotions, facial expressions, body language, etc.).

Considering the most basic implementation, the lines and actions of the virtual person can be expressed through the performer behind the scenes, then the core question is left: how do we achieve the virtual person expression with fake real?

Here is a temporary sale of a close, about the character expression of this important topic in the film and television performance, stay in the follow-up third article and then come back to discuss.

The manual past life of the virtual person and the ai of the present life

The virtual person of the Meta Sky City represents Jasmine

Before concluding this long article that has exceeded the standard, by the way, I would like to introduce you to the MetaHuman virtual person Jasmine, which was created by the author himself. As the representative of the Meta Sky City in the Meta universe, Jasmine will have more opportunities to meet with you in the future content :)

Read on