laitimes

Meta shared AI voice system CAIRaoke: Building ar/VR voice assistants for natural dialogue

author:Yingwei Nweon

(Nweon, February 25, 2022) Meta recently held an artificial intelligence lab unveiling event called "Meta AI: Inside the Lab". In addition to demonstrating the latest breakthroughs made by the AI team, Meta also hopes to further illustrate how AI will empower the company's metaverse into the future.

In a project called Project CAIRaoke, Meta wants to build a future AI assistant that can have a natural conversation with you. The following Yingwei network sorted out the team's blog post to share:

Meta shared AI voice system CAIRaoke: Building ar/VR voice assistants for natural dialogue

If we can interact with AI assistants in natural conversational language, just like ordinary human-to-human communication, it will greatly improve our quality of life. But whether communicating with them via voice or text message, today's AI assistants are always machine-like. When you make common requests like "Mute all notifications for the rest of the day, unless it's my mom," they often fail to respond properly, let alone "Can I rent a local community center for a private party?" Or complex tasks like "Plan an affordable family beach vacation for the weekend of July 4."

So, it's time to deliver better conversational AI.

In order to achieve this goal, Meta has officially announced project CAIRaoke. The team developed an end-to-end neural model and has already used the model produced by Project CAIRaoke in Portal. It can have more personal and situational conversations than the system people are now familiar with. The company's goal is to integrate it with augmented reality and virtual reality devices in order to enable immersive, multi-modal interactions with AI assistants in the future.

Perhaps the biggest obstacle to better conversational AI is the architecture that drives today's advanced digital assistants. Although the system provides only one service, they actually rely on four separate components: natural language understanding (NLU), dialog state tracking (DST), dialog policy (DP), and natural language generation (NLG). Disparate AI systems must be joined together, so they are difficult to optimize, are not good at adapting to new or unfamiliar tasks, and are highly dependent on labor-intensive annotated datasets.

This is one of the reasons why digital assistants that serve most devices today will only offer the mechanical option to forget the scenario of the conversation and follow the prescribed conversation process. For example, you can ask your assistant about the local weather forecast, but if you continue to ask simple but unexpected questions like "Is the weather hotter than last week?" ", it will not be able to respond well.

By using the model created by Project CAIRaoke, people will be able to talk to conversation assistants naturally, so that they can review previous content during conversations, completely change the topic, or mention content that relies on understanding complex, subtle situations. You can even interact with them in entirely new ways, such as using gestures.

Meta has already started using the model on video calling device Portal to make it easier to create and manage reminders. For example, you can quickly clarify a request like this:

You: Set the alarm to 6:30.

Assistant: 6:30 a.m. or 6:30 p.m.?

You: At night, then the reminder column is called 'Buy Eggs'.

Assistant : Okay, the reminder time to buy eggs is set for 6:30 pm tomorrow.

Even in this early test, Meta believed that the model performed better than the standard method. When the team looked at Portal, they found that Project CAIRaoke had significant improvements in reminders compared to existing methods. The assessment is measured by the success rate of completing a set of reminder goals while maintaining a normal number of rounds.

But this is only the first step in taking advantage of the new technology in question. The team believes that project CAIRaoke's progress will help enable richer communication between humans and AI and become an important tool for building the future of the metaverse. In the future, the PROJECT CAIRaoke digital assistant built into AR glasses may interact with you in a series of forms that feel natural. For example, if you ask, "What should these pants go with?" It can answer, "This one has your favorite red shirt," and it can even show images of related items. If you say, "I like it, but the stripes are too wide." At this point, it will show a pinstriped version.

In the future, Meta hopes to be able to take advantage of the models generated by the said project in the daily applications of millions of people around the world.

1. Build truly interactive conversational ARTIFICIAL intelligence

A necessary step in advancing conversational AI is to understand the full scope of the problem. You may know many of the latest advances in the NLU, such as BART and GPT-3, and think that the challenge of understanding and generating human-like text has been solved. But we haven't reached the said milestone yet. To understand this, we must distinguish between comprehensible AI and interactive AI. The former has received adequate research and development throughout the industry. It is used to extract meaning from various input patterns, such as automatic speech recognition, image classification, and NLU. The latter is how we use our understanding of the world to interact with the people who use technology. This can be sending text, voice commands, haptic feedback, displaying images, videos, or a combination of related.

Researchers and engineers across the industry agree that a good conversational system requires a solid layer of understanding powered by AI models. But many people see interaction as an engineering problem, not an AI problem. So, engineers who understand the state of the world can create a complex logic to handle the required interactions. The engineering approach makes it easy to understand how the system works and quickly debug logic when necessary. However, this pervasive belief has led to a not-so-powerful conversational AI, which is one of the main reasons why you can't easily plan your vacation through them.

2. A new, unified approach

Meta shared AI voice system CAIRaoke: Building ar/VR voice assistants for natural dialogue
Meta shared AI voice system CAIRaoke: Building ar/VR voice assistants for natural dialogue

The example dialog above demonstrates the key skills that Meta wants the assistant to have: not only providing accurate, up-to-date real-world knowledge, but also working across multiple modes (in this case, across vision and speech), across domains (sending messages and estimating arrival times), allowing you to drive conversations without having to follow rigid conversation templates.

The ai assistant's prescriptive approach requires four sets of inputs and outputs: one for each layer of the pipeline (NLU, DST, DP, and NLG). It requires defining standards for both the input and output of each layer. For example, in the case of NLU, traditional conversational AI systems require a defined ontology.

However, Meta's model uses neural networks and does not prescribe a conversational flow at all. With this model, the team only needs a set of training data.

Project CAIRaoke reduces the amount of work required to add new domains. In the canonical approach, expanding into a new domain requires building and fixing each module in turn before the next module can be reliably trained. In other words, if the NLU and DST change every day, the training DP cannot be done effectively. Changes to one component can affect other components, triggering retraining of all subsequent modules. This interdependencies slow down subsequent modules. But with the end-to-end technique described, Meta eliminates the reliance on upstream modules, which increases development and training speeds and enables teams to fine-tune other models with less effort and data.

In this new approach, conversations are more powerful because they are able to make decisions by viewing all the information in one place. Previously, even a small error in one component could spread to other components in unexpected and intractable ways. For example, current rule-based assistants are explicitly programmed to look for specific words or phrases after numbers, "p.m." To represent the afternoon, Project CAIRaoke leverages advanced pre-trained language models that allow for better understanding of situations and the ability to identify different ways of expressing the same thing.

Finally, Project CAIRaoke will support BlenderBot 2.0, Meta AI's latest conversational robot. This means that assistants built using models can exhibit empathetic language, pass on knowledge discovered through real-time searches of the internet, and exhibit consistent personalities.

When systems generate natural language, they must address potential security and privacy challenges. Today, most NLG components are scripted so that content moderators ensure that assistants do not provide offensive responses to users. But by docking the assistant directly to the user, there is a risk of error or offensive interaction.

Importantly, Meta has added protections to BlenderBot, which will help reduce aggressive responses. The team also develops assistive technologies with privacy in mind. For example, for Ray Ban Stories and Portal, the use of voice commands is optional, you can view and delete transcripts of voice commands, and you always have the option to turn off speech storage.

To reduce the risk of adverse reactions to users, Project CAIRaoke's first milestone was generating conversational actions and natural language. In the short term, dialogue actions are generated and rely on a tested and tightly constrained NLG system to provide user responses. In the long run, after ensuring the end-to-end integrity of the model, the team will expose the generated sentences.

Another problem model confidently states incorrect information. This is a huge challenge for end-to-end techniques, as the model may introduce or change entrities in dialogs based on training data. For example, if you ask the assistant to "set a reminder to call Tom," it might set a reminder to call Tom because Tom is a less common name. Meta is using a variety of data augmentation techniques and attention networks to enhance the robustness of Project CAIRaoke and leveraging BlenderBot 2.0 to reduce said issues.

3. Use your voice to complete countless daily tasks

While the Project CAIRaoke model implemented in the short term is a reminder for Portal, the team hopes to soon apply it to a larger field that will help personalize people's shopping experience and allow people to drive the conversation process.

Meta also believes that this advance is particularly useful for building AI conversational capabilities for augmented reality. In the near future, people will regularly use voice assistants in AR glasses, just as they do today with smart speakers, smart watches, and other devices. With that in mind, the team is working to shrink the size of an end-to-end model like this one. Researchers are also working to improve the ease of debugging of the model. This is a complex challenge because in this new framework, information is represented in the embedding space, while in the canonical model, information is explicit. To fully realize the vision of Project CAIRaoke, it needs to be extended to multiple languages and find ways to use the said model efficiently.

Meta shared AI voice system CAIRaoke: Building ar/VR voice assistants for natural dialogue

The company concluded: "We can imagine that in a few years, Project CAIRaoke's technology will be the basis for the next generation to interact with devices." For devices such as VR headsets and AR glasses, we expect this communication to eventually become a ubiquitous method of seamless navigation and interaction, just as touchscreens replaced keyboards for original smartphones. Our current model is an important step forward, but we have more to do to fully realize this vision. But we are very excited about the progress made so far and the challenges we face. ”