laitimes

"Procurement and Sales Dongge" has opened the era of live streaming 3.0, and more than 100 enterprise CEOs are training for digital human customization

author:Titanium Media APP
"Procurement and Sales Dongge" has opened the era of live streaming 3.0, and more than 100 enterprise CEOs are training for digital human customization

Image source: Jingdong live broadcast room

In 2023, driven by the popularity of ChatGPT, AI will once again stand in the C position of the technology stage. However, at that time, generative AI capabilities were limited to "Wenshengwen".

In February 2024, OpenAI dropped another "nuclear bomb" on the technology world and launched the "Wensheng Video" large model Sora. Sora can create up to 60 seconds of video based on the user's text prompts, and the model understands how these objects exist in the physical world, and can deeply simulate the real physical world, generating complex scenes with multiple characters and specific movements. Inheriting the image quality and command compliance capabilities of DALL-E 3, it can understand the user's request in the prompt. The advent of Sora ushered in a new era of large-scale model applications.

In addition to the advent of Sora, in the past five months, the application of large models on the industry side and the industry side has been emerging, and it can be said that this year is destined to be a year of rapid implementation of large industry models. For the live broadcast industry, the digital human with the blessing of multi-modal large models will also open the curtain of a new era in the live broadcast industry.

And this historic moment has already happened. On April 16, the "Procurement and Sales Dongge" AI digital human created by JD Yunyanxi made its live broadcast debut, and at the same time appeared in the live broadcast room of JD Home Appliances and JD Supermarket, opening the AIGC-style e-commerce live broadcast 3.0 era. The person in charge of Jingdong Yunyanxi told Titanium Media that after the debut of Dongge, there are more than 100 enterprise CEOs who need to customize digital people, and they are stepping up training.

One year after the application landed, the "volume" capability and the "volume" application

If 2023 is the year of AIGC "volume" computing power and "volume" parameters, then starting from 2024, the next few years will be the era of AIGC "volume" application and "volume" capabilities. The person in charge of JD Cloud Yanxi told Titanium Media that the role of engineering will become more and more important, and these technologies will eventually reflect the real value in the application. "We hope that after the emergence of a technology, it will eventually be able to be effectively implemented in one or more scenarios, not just on the demo, and digital human live broadcast is one of the application scenarios where we have discovered value. The person in charge of JD Yunyan said to Titanium Media.

In view of the development prospects of digital people based on large models in the field of live broadcasting, the person in charge of Jingdong Yanxi told Titanium Media that at present, digital human live streaming has a great opportunity to become a "hot point" in the field of live broadcasting, "mainly because digital people have reached a new level at the content level, and Yanxi has formed a deep accumulation at the level of operation methodology, and people's acceptance and trust in digital people are also high." The person in charge said.

When talking about the current application of digital humans, the person in charge of JD Cloud Yanxi told Titanium Media that the value of digital humans in live streaming is more manifested in human-machine collaboration and symbiosis. According to the statistics of JD Cloud, the data performance of the live broadcast room of man-machine file is significantly better than that of pure people or pure digital people, "At this stage, the value of digital people is not to replace real people, but to create a 'sun never sets' live broadcast room and tap the value of live broadcast in leisure time through the form of relay with real people," the person in charge pointed out, "At present, Yanxi digital people have increased the conversion rate of idle time by more than 30%. ”

End-to-end technology, 50,000 hours of voice data, a large model digital human should look like this

It's good to know that digital humans are good, but truly real-time, interactive, and lifelike digital humans are far more difficult than the technology needed to generate a one-minute video with Sora.

It is understood that in order to create a more realistic digital human, Yan Xi chose the end-to-end technical route as early as more than 2 years ago, that is, the integration of modeling-driving-rendering, and coincidentally, Sora also chose the end-to-end technical route.

From the perspective of the current end-to-end technical route, it is mainly divided into two categories: one is complete end-to-end, and the other is modeled for individual data.

Among them, the complete end-to-end approach does not model any link in the middle, which is completely implicit.

The way of modeling some materials will be modeled according to the face quota material, and then the expression and lip shape of the digital person will be controlled, and finally the texture will be rendered, "JD Cloud will choose these two solutions according to different scenarios, but they will be used. The person in charge of JD Yunyan pointed out.

In the whole process of digital human modeling, how to realize the large-scale posture of the character is the most difficult point. In this regard, the person in charge of Jingdong Yunyanxi told Titanium Media that the inability to have large-scale activities is one of the main reasons why many digital people do not look like real people, and it is necessary to make digital people look the same as real people.

In view of this, in the training process, Yanxi Digital Human has carried out refined focus from various aspects such as data collection and data cleaning, compressed and quantified the model code, and modified the model accuracy. Through various technical means, the final digital human has realized the ability to act like a real person.

In addition, the difference between Yanxi Digital Human and Sora is that Yanxi Digital Human needs real-time synchronous voice broadcasting. In this regard, the person in charge of Jingdong Cloud Yanxi told Titanium Media that JD Cloud used more than 50,000 hours of various material voices in order to train Yanxi digital humans, so that the underlying basic model has the basic pattern of human pronunciation and established a good voice model, "After more than 50,000 hours of data training, the basic model has the ability to imitate anyone to speak, and this ability is not only limited to Chinese, but can even speak English fluently." The person in charge emphasized.

It is worth mentioning that after 50,000 hours of speech data training, the basic model only needs to provide some of the voice fragments of the imitated person, and the model does not need to be trained again, and the original timbre and speech expression of the imitated person can be directly synthesized.

With the dual blessing of voice and video, the "Procurement and Sales Dongge" digital person has no flaws in several recent live broadcasts, and it can be regarded as successfully passing the Turing test.

Although digital humans have strong capabilities, in the view of the person in charge of JD Cloud Yanxi, the essence of live streaming is to emphasize operation, "operation should be effect-oriented and lead planning, and comprehensively plan each live broadcast from image, performance, decoration, interaction, display, etc., and products and technologies must closely carry out their own design and construction work around this set of plans." The person in charge pointed out, "It is also out of the implementation of the core methodology that operation is king, Yanxi also began to provide agency operation services to some key brand partners this year, in order to use these head brands as the fulcrum, explore and quickly spread effective digital human live broadcast operation experience, and help the industry to grow rapidly and fission." ”

Lower cost and lower threshold are the direction of development

Although the digital human made by the large model is easy to use, it is only the beginning, and the cost and threshold are the keys to determine whether a technology can be applied on a large scale. With the rapid development of AIGC, there are already different voices in the industry - bigger models are not better.

Robin Li, the founder, chairman and CEO of Baidu, has publicly pointed out that in the future, large-scale AI-native applications will mainly use a mix of large and small models. He explained that this model is called MoE, which does not rely on a single large model to solve all problems, but chooses the appropriate model for different scenarios. "In some specific scenes, the use of a finely tuned small model can even be comparable to that of a large model. Robin Li said.

Coincidentally, Zhou Hongyi, founder and chairman of 360 Group, also pointed out that in specific landing scenarios, while making the large model "big", it is also an important trend to be "small", so that the large model can be carried on mobile phones, computers, and various Internet of Things devices, especially intelligent networked vehicles, and more large models will be equipped in 2024.

Xie Dong, chief technology officer and general manager of the R&D center of IBM Greater China, also publicly stated that for enterprises, the purpose of the application model is to hope that it can solve specific problems at a lower cost. Xie Dong pointed out.

.......

It is not difficult to see from the words of the above-mentioned industry leaders that although the large model has certain capabilities, its application cost and application threshold are high, and it is difficult for ordinary enterprises to afford the high cost caused by the application of the large model. At the same time, compared with large models, smaller models can enable AI to truly "specialize in the industry" and achieve the greatest value at the lowest cost.

It is worth noting that the director of JD Cloud's Yanxi algorithm told Titanium Media that at present, Yanxi digital human can support two forms: cloud and local deployment, among which, local deployment not only supports GPU, but also supports CPU-only deployment, and the effect will not be compromised. Compared with the common large-scale model products on the market that must require the blessing of GPU capabilities, this has obvious advantages in both chip procurement costs and later operating costs, reducing the threshold for brand owners. "The model proposed by JD Cloud can accurately estimate the posture of the character, as well as the texture modeling and joint optimization of each image, and even if the lightweight model is adopted, it can generate natural and realistic effects. JD Cloud Yanxi Algorithm Director said.

"At present, there are many lightweight methods in the industry, such as quantization or model compression, but JD Cloud Yanxi can directly do CPU inference, which is very important for cost saving. JD Yunyanxi algorithm director further pointed out.

The person in charge of JD Cloud told Titanium Media that the digital human is directly generated in real time through the training of the generative network model, "It can support thousands of live broadcasts directly in the cloud at the same time, further reducing the cost of broadcasting by 30%." The person in charge emphasized.

Talking about the future, the person in charge of JD Cloud Yanxi told Titanium Media that if the digital person is divided into three levels, the first level is to be on par with the real person, the second level is comparable to the real person, and the third pole is to be able to integrate the thoughts and cultural background of the real person, etc., "At present, the Yanxi digital person has reached the level comparable to the real person, but for the digital person, there is still a long way to go, and there is still a long way to go to make the digital person have the cultural background and thinking logic of the real person, and become a real digital clone." This is also an important direction that the Yanxi model will continue to try. The person in charge noted.

Read on