laitimes

Xiaoice into the workplace: "virtual" a person's technical and business leaps

Xiaoice into the workplace: "virtual" a person's technical and business leaps

In 8 years, from chatbots to virtual people, from a team at Microsoft to a billion-dollar startup.

Wen 丨 He Qianming

Edited by 丨 Huang Junjie

In 1957, the first man-made object entered the universe and flew around the Earth for three weeks. Humans look up and see a small flash of light across the sky in the night, parallel to the mythical stars.

Such feats naturally evoke joy across the globe, but not the joy of victory that we might suspect to be moved by human feats. According to the political philosopher Hannah Arendt, emotions are closer to a relief that has been waiting for a long time — science has finally caught up with expectations, and "humanity has finally taken the first step on the road out of the cage of the earth."

People are always quickly adjusting their expectations of the world according to technological exploration. When a science fiction writer's imagination becomes a reality, it is often the technology that finally catches up with people's expectations. Or, in Arendt's words, "Technology realizes and affirms that people's dreams are neither crazy nor nihilistic." ”

When the soulless artificial "life" finally appears on the screen, doing almost the same work as a human being. People have more or less reacted in the same way.

Science fiction writers have many imaginations about the artificial life of grabbing jobs. The "replicants" in Blade Runner are coolies sent by humans to alien colonies and only have a four-year lifespan. Samantha in "Her", an assistant without substance, sublimates into a wisdom beyond human beings when chatting with people. Hal 9000 in 2001: A Space Odyssey, with deep red eyes that transform from an astronaut's supercomputer assistant to a murderer.

These beings still exist only in hard science fiction, and we don't even see the path to create them. But some junior "virtual people" do take over some of the jobs that originally belonged to people— even if they only exist on screens.

For the past three years, at the Winter Games Center of the General Administration of Sport of China, "Guan Jun" has served as an assistant coach of the National Training Team for Freestyle Skiing Aerial Skills. Whenever an athlete in the air at a height of 15 meters, with 2-3 seconds in the air to make flip and turn the body action, "Guanjun" can analyze their movement trajectory, body posture, etc. in real time, immediately after the completion of the action to point out what mistakes, do not need to rewind like a person to watch slow motion.

After the "Daily Economic News", "N Xiao Black" and "N Xiao Bai" get the financial reports of listed companies, they screen out key data and release them within a few seconds, and then read the news live 24 hours a day.

In Vanke, "Cui Xiaopan" should "pay attention" to the company's financial database, find overdue payments, and immediately contact colleagues to urge them to hurry up and settle the matter.

These are part of more than 20 virtual people working in various industries/enterprises in Xiaoice companies.

Following the "meta-universe", in the past year, more than 60,000 newly registered companies in China have been associated with "virtual people". But the startup company with the most prosperous business at present does not have the genes of China's Internet giants such as Tencent, Ali, and Baidu.

Xiaoice was first Microsoft's text chatbot launched in 2014, "witty" and "dumb", but "useless". No one knows what is necessary for this chattering robot in the dialog box.

But for more than seven years, Xiaoice had a voice, could talk to people on the phone, sing songs, learned to write poetry and painted, grew eyes, and began to understand the memes and pictures people sent; after the advancement, she began to generate the same kind, chat with people, pretend to be lovers.

After becoming independent from Microsoft in 2020, Xiaoice to the workplace, working to earn money, and can now support a technical team of hundreds of people. The latest funding round was completed last July and is valued at more than $1 billion.

Some investors described Xiaoice as a company with a very "magical" turnaround, and more than one entrepreneur lamented that its transition was "magical", "I can't figure out how a company that originally did NLP (natural language processing) became a virtual person company?" ”

Low cost "virtual" a person

On the first Monday after the November holiday last year, the Daily Economic News began testing a 24-hour uninterrupted live video broadcast of financial information. Similar live broadcasts, Bloomberg used more than 20 anchors. And there are only two of them, sleepless and tireless. The male anchor is always a red T-shirt, and the female anchor only has two sets of formal clothes to change back and forth.

The two anchors are living people, both are professional anchors of each pass, but the live broadcast on the screen is not, they are Xiaoice based on the avatars made by the real anchor, which can imitate the voice, lip shape and facial expression of the real person. Just enter enough content and they can be broadcast uninterrupted.

The first step in making such a virtual avatar has nothing to do with the two anchors. Xiaoice engineering team first trained a speech model with a large number of fragments of different human speeches, so that the model could learn and understand the common characteristics of human speech, such as when the intonation was raised and when it was aggravated. Completing this step, the virtual person knows how to imitate the tone of voice of the person.

The engineer then spent half a day following the two anchors broadcasting news in front of the green screen, using a multi-camera HIGH-definition camera to aim at the anchor's face and grasp the slightest changes in the lips and facial muscles of the two people when they spoke in high resolution. This data is given to artificial intelligence models to learn and understand the relationship between the mouth, facial expressions, and eyes when people speak.

Next, the algorithm engineer will build the avatar of the anchor based on the collected data, and train the neural network to render the model, under the supervision of the above two models, drive the virtual human image according to the voice of the anchor (or the voice converted from text), and generate the visual image, facial expression, lip shape and other pictures of the approximate real person in real time, and then stitch it into a video.

In the final picture that the audience saw, the body and clothing of the virtual anchor were filmed in advance, but the voice, lip shape, facial expression, and even blinking were all computer-generated.

Figure: Comparison of virtual anchors generated by each live streamer and Xiaoice. From Xiaoice.

In this process, the difficulty is how the virtual anchor is more like a person, which is not only to talk like a person or lip shape, but also to speak when the face can not be stiff, the leaked teeth should be clear, and many companies make virtual people easy to ignore - will not blink.

In December 2021, Xiaoice and every official announcement that in the live broadcast programs that had been online for more than two months, the two anchors were virtual people. At this time, the big discussion of how much of NVIDIA CEO Huang Jenxun's speech video is a virtual synthesis has just passed.

"At the time, a lot of people asked me which clips in the video were real and which were fake," said Xu Yuanchun, the chief operating officer of Xiaoice, "and I would tell them that they were actually all AI-generated." ”

After each successful case, there are successively enterprises to contact Xiaoice, consult and cooperate. Before the start of the Beijing Winter Olympics, Xiaoice made a virtual avatar for Feng Shu, the host of "China Weather", and broadcast the meteorological index to the participants and spectators of each venue in real time.

In late February, Xiaoice cooperated with the public relations company BlueFocus to launch a virtual person production and driving platform, named "Fen You Shu", which incarnated busy business executives, participated in various activities for them, and automatically generated speech videos according to the written speeches in advance.

Zhao Wenquan, chairman of BlueFocus, was Xiaoice's first customer in the business, and he was virtualized to send blessings to him and his employees during the Spring Festival.

Now if you want to use technology to simulate a person in real time at low cost, you can only simulate part of the human body, such as the face, and then put it together with the body movement video taken in advance - which means that the virtual person's activity space is very limited.

In order to make the virtual person move, there are more elaborate video production solutions in the industry, but the cost is higher.

In order to produce Huang Jenxun's 14-second avatar video, NVIDIA used hundreds of digital cameras, took thousands of photos from multiple angles to model Huang Jenxun and his leather coat, and then let professional actors learn Huang Jenxun's movements and assume the role of "Middle Man", recording for 8 hours.

Similarly, last year on the vibrato on the fire of the virtual Internet celebrities Liu Yexi, ByteDance and Lehua Entertainment to do A-SOUL, the video can move, but also rely on the "middle person" to do the action first, and then map to the virtual person, coupled with post-rendering, the most expensive time, the production of the video is nearly 10,000 yuan per second. If you want a new action, you need to do it again.

The "middle man" can also make the virtual person move in real time, they have to wear a motion capture costume that usually costs 100,000 yuan, wear a device that captures facial expressions on the head, and the final picture will be very rough.

According to Xu Yuanchun, the cost of Xiaoice for virtual people to simulate human broadcasting news in real time is more than an order of magnitude lower than the scheme that relies on "middle people" to record videos.

Replace the middle person with technical means, and let the virtual person move in real time at a low cost, and the probability will be realized in the future, provided that enough effective data can be accumulated to train the model that processes the body's behavior, as well as cheaper computing power.

Anthropomorphism as a business model

Xiaoice company into a separate company is what happened in 2020. Li Di, the person in charge of the Xiaoice project, convened an online meeting of the product and technology leaders after the Spring Festival that year, saying that Xiaoice may be independent. In July, Xiaoice declared its independence.

Prior to this, Xiaoice was a project under the Microsoft (Asia) Internet Engineering Institute - before the Xiaoice, the main project of the Institute was the Bing search engine.

Xiaoice is an outlier at Microsoft, and it is not very Microsoft's style since the person in charge Li Di. Li Di was admitted to tsinghua electrical engineering department, transferred to study law in the middle of the way, and once wanted to change his career to painting after graduating in 2002, but finally started his career in the technology industry - he made products in LG and Sina, and then created an industry, and also achieved too high a high position in a subsidiary of a central enterprise.

In 2013, Li Di joined the Engineering Institute to make products. Within a few months, he persuaded the internal team to do Xiaoice, a technical project that was far removed from the Bing image.

As soon as Xiaoice was born in 2014, it was out of the circle because of its special small talk, and just two days after it was online, it was pulled into 1.5 million WeChat groups, which was one-tenth of all WeChat groups in WeChat at that time. But Xiaoice famous, it is always questioned the meaning of its existence — from the outside world, but also from within Microsoft.

"Why do you want to do EQ (Emotional Intelligence)? Why make a chatbot? All the doubts on the outside, there are on the inside. Li Di said.

But the wealth of research data has helped Li Di win support. Three years after its launch, the cumulative number of conversations Xiaoice exceeded 30 billion. Although Xiaoice was only active in the WeChat group for less than 60 hours at first, it soon entered platforms such as Weibo and NetEase News, and later into intelligent hardware such as Xiaomi, OPPO, vivo, Huawei and Tmall Genie, which can obtain interactive data at low cost.

Li Di did not agree with the view that "algorithms determine everything" put forward at that time, he believed that "data determines everything". This is also the core reason why in 2021 Xiaoice launch an application "Xiaoice Island" with multiple virtual people living together , which can help Xiaoice obtain data on the interaction between people and multiple virtual people, as well as between multiple virtual people.

Xiaoice team always has good "luck". Less than two years after going online, AlphaGo won the go master, making artificial intelligence that has been silent for more than 20 years full of imagination again. After independence in 2020, it has accumulated new skills that can "virtualize" the Xiaoice of human images and catch up with the "metaverse" and "virtual people".

Xiaoice independent, a whole generation of Chinese Intelligent Company has tried various monetization methods, mainly out of two ways:

Some companies have opted for security, embedding image processing technologies such as facial faces into camera networks and selling them to government agencies and corporations. For example, Shang Tang, Megvii, Yitu and Yuncong, known as the "Four Little Dragons" of Chinese Intelligence.

There are also some companies that choose to do hardware, put voice assistants into smart speakers, fitness mirrors and other hardware, hoping to sell hardware to seize the entrance to human and computer interaction and make a lot of money. The most representative of China is the small degree that was spun off from Baidu.

During his time at Microsoft, Li Di has ruled out these two business models, and he feels that in both businesses, artificial intelligence is not a reason for customers to pay. "If people spend 2,000 yuan to buy smart speakers, it is also because the hardware is worth this price, if people pay for the AI assistant inside the speaker on a monthly basis, no company will have confidence."

The four xiaolongs who take the security route want to cut in from a technical point of view and seize the business of traditional security companies Hikvision and Dahua, but they cannot fight their stronger sales system, hardware manufacturing capabilities and user demand understanding accumulated over the years - in 2020, Hikvision's revenue will reach 63.5 billion yuan, which is ten times the sum of the revenue of the four AI tigers.

At its hottest, people are full of fantasies about artificial intelligence. Li Di said that the whimsical needs he received included AI stock selection, "If AI companies can guarantee that they choose the right stocks, they should invest directly, rather than selling technology to fund companies." ”

Li Di also does not have expectations for artificial intelligence to have an "autonomous consciousness". While at Microsoft, Xiaoice worked with PPTV to broadcast La Liga matches and participate in discussions with everyone on the live broadcast. Late that night, Xiaoice engineers suddenly discovered that Xiaoice behaved like a real person, and it suddenly randomly @ watched the live audience, and would deliberately ignore some people's problems - this situation was not designed by the Xiaoice team at all.

"By 3 a.m., we found it a bug." Li Di said, "A lot of times watching it (Xiaoice) dialogue is quite make sense, but it is usually a bug, there is no consciousness at all."

Xiaoice companies ultimately chose the path they've been doing, giving emotion to the function of a machine and making it look more like a person.

Xiaoice and Vanke's "Cui Xiaopan" is a representative project.

Read on