The AI keyboard man is coming: DeepMind begins training agents to "play" with computers like humans

2022-02-24 13:00:50

Reports from the Heart of the Machine

Machine Heart Editorial Department

Humans spend billions of hours a day using digital devices. If we can develop agents that assist in some of these tasks, it is possible to enter a virtuous cycle of agent assistance, and then based on human feedback on failures, improve the agent and give it new capabilities. DeepMind has new research in this area.

If machines can use computers like humans, they can help us with everyday tasks. In this case, it is also possible to take advantage of large-scale expert demonstrations and human judgments about interaction behavior, which are two factors driving AI's recent success.

Recent work on 3D mimicking the behavior of natural language, code generation, and multimodal interactions in the world (deepMind Interactive Agents Team 2021) has produced models with superior expressiveness, contextual awareness, and a wealth of common sense. This study is a powerful demonstration of the power of two components: a rich, combinatorial output space that is consistent between machines and humans; and the vast amounts of human data and judgments that inform machine behavior.

One area that has received less attention with both components is digital device control, which involves the use of digital devices to accomplish a number of useful tasks. Due to the almost exclusive use of digital information, the field is well extensible in terms of data acquisition and control parallelization (compared to robotics or fusion reactors). The field also combines diverse, multimodal inputs with expressive, composable, and human-compatible availability.

Recently, in DeepMind's new paper" "A Data-driven Approach for Learning to Control Computers", researchers focused on training agents to perform basic computer control of keyboards and mice like humans.

Address of the paper: https://arxiv.org/pdf/2202.08137.pdf

The benchmark used by DeepMind for its initial investigation of computer control is the MiniWob++ Task Suite, a challenging set of computer control questions, which contains a set of instructions to perform clicks, typing, filling out forms, and other such basic computer interaction tasks (Figure 1b below). MiniWob++ further provides programmatically defined rewards. These tasks are a first step toward more open human-computer interaction, in which humans use natural language to specify tasks and provide follow-up judgments about performance.

The researchers focused on training agents to solve these tasks, using methods that are in principle applicable to any task performed on a digital device and have data and computational scaling characteristics that meet expectations. So they directly combined reinforcement learning (RL) and behavioral cloning (BC), in which behavioral cloning is aided by alignment between the human and agent space of action (i.e., keyboard and mouse).

Specifically, the researchers explored the use of keyboards and mice for computer control and the designation of objects through natural language. And, instead of focusing on hand-designed courses and dedicated action spaces, they developed a scalable approach based on reinforcement learning that combined with behavioral priors provided by leveraging actual human-computer interaction.

This is a combination proposed in the vision of MiniWob (a 2016 benchmark for reinforcement learning agents that interact with websites proposed by OpenAI, and MiniWob++ is an extended version of it), but it was not found that it could generate high-performance agents at the time. As a result, the work after that attempts to improve performance by giving the agent access to the operation of a particular DOM, and to reduce the number of actions available in each step through limited exploration techniques using carefully curated guidance. By revisiting the simple, scalable combination of imitation and reinforcement learning, the researchers found that the main missing factor in achieving high performance was simply the size of the human trajectory dataset used for behavioral cloning. As human data increases, performance improves reliably, using datasets that are 400 times larger than in previous studies.

The researchers achieved SOTA and the human average on all tasks in the MiniWob++ benchmark and found strong evidence for cross-task migration. These results prove that the unified human-machine interface during the use of computers in training machines is very useful. Taken together, the researchers' results demonstrate a scheme that goes beyond the MiniWob++ benchmarking capabilities and controls computers like humans.

For DeepMind's research, most netizens exclaimed "incredible".

The AI keyboard man is coming: DeepMind begins training agents to "play" with computers like humans

method

MiniWob++

MiniWob++ is a web browser-based suite proposed by Liu et al. in 2018 as an extension of the earlier MiniWob (Mini World of Bits) task suite, a reinforcement learning benchmark for interacting with websites that senses the raw pixels of small web pages (210x160 pixels) and produces keyboard and mouse motion. MiniWob++ tasks range from simple button clicks to complex form fillings, for example, booking a flight when a specific instruction is given (Figure 1a).

Previous research on MiniWob++ has considered architectures that can access DOM-specific actions, allowing agents to interact directly with DOM elements without requiring mouse or keyboard navigation to it. DeepMind's researchers chose to use only mouse and keyboard-based operations, further hypothesizing that the interface would better migrate to computer control tasks without having to interact with the compact DOM. Finally, miniWob++ tasks require click or drag operations that cannot be achieved with DOM element-based operations (see the example in Figure 1b).

As with previous MiniWob++ studies, DeepMind's agents have access to a dictionary of text strings provided by the environment, which is entered into the input fields for a given task (see Appendix Figure 9 for an example).

The following figure shows the computer control environment running MiniWob++. Both humans and agents use keyboards and mice to control computers, humans provide exemplary behaviors for behavioral cloning, and agents are trained to mimic this behavior or exhibit behaviors in pursuit of rewards. Humans and agents try to solve the MiniWob++ task suite, which includes clicking, typing, dragging, filling out forms, and so on.

Environment interface

If you want agents to use computers like humans, they need interfaces to transmit and receive observations and actions. The original MiniWob++ task suite provided a Selenium-based interface. DeepMind decided to implement an alternative environment stack designed to support a variety of tasks that agents can implement in a web browser. The interface is optimized for security, features, and performance (Figure 1a).

The original MiniWob++ environment implementation accesses internal browser state and issues control commands through Selenium. Instead, DeepMind's agent interacts directly with the Chrome DevTools protocol (CDP) to retrieve information inside the browser.

Agent architecture

DeepMind found that there was no need to base on a specialized DOM processing architecture, instead, due to recent influences on multimodal architectures, DeepMind applied minimal modal-specific processing, which relied primarily on multimodal transformers to flexibly process relevant information, as described in Figure 2.

perceive. The agent receives visual input (165x220 RGB pixels) and linguistic input (sample input is shown in Appendix Figure 9). The pixel input passes through a series of four ResNet blocks with 3×3 cores, strides of 2, 2, 2, and output channels (32, 128, 256, 512). This yields a 14×11 feature vector, which DeepMind flattens into a list of 154 tokens.

Three types of language input task instructions, the DOM, and the task field are processed using the same module: each text string is split into tokens, and each token is mapped to an embedding of size 64.

Policy: An agent policy consists of four outputs: action type, cursor coordinates, keyboard key index, and task field index. Each output is modeled by a single discrete probability distribution, which is modeled by two discrete distributions in addition to the cursor coordinates.

The action type is selected from a set of 10 possible actions, including one no action (for no action), seven mouse actions (move, click, double-click, press, release, up wheel, wheel down), and two keyboard actions (keys, text).

DeepMind collected more than 2.4 million 104 MiniWob++ task demos from 77 human participants for a total of approximately 6300 hours and trained agents using a simple blend of imitation learning and reinforcement learning (RL).

Experimental results

Human-level performance on MiniWob++

Since most studies typically address only a subset of miniWob++ tasks, the study takes the best performance that has been exposed on each individual task, and then compares the aggregated performance of those subtasks with the agents proposed in the study. As shown in Figure 3 below, this agent significantly exceeds the SOTA baseline performance.

In addition, the agent achieves average human-level performance in the MiniWob++ task component. This performance is achieved by combining BC and RL combined training.

The researchers found that while the average performance of the agent was comparable to that of humans, there were tasks where humans performed significantly better than the agent, as shown in Figure 4 below.

Task migration

The researchers found that training an agent on all 104 tasks of MiniWob++ can significantly improve performance compared to an agent trained individually on each task, as shown in Figure 5 below.

extend

As Shown in Figure 7 below, the size of the human trajectory dataset is a key factor affecting the performance of agents. Using a dataset of 1/1000, which is equivalent to approximately 6 hours of data, results in fast overfitting and no significant improvement over performance using RL alone. As the study increased the amount of data from this baseline to three orders of magnitude up to the size of the full dataset, the performance of the agent continued to improve.

In addition, the researchers also noted that as the algorithm or architecture changes, the performance on the size of the data set may be higher.

Ablation experiments

The agent uses pixel and DOM information and can be configured to support a range of different operations. The study conducted ablation experiments to understand the importance of various architecture choices.

The study first ablates the different agent inputs (Figure 8a). Current agent configurations rely heavily on DOM information, and if this input is removed, performance degrades by 75%. In contrast, the input of visual information has less significant effect on the agent.

As shown in Figure 8b, the study removes the ability of agents to use context-given text input options (task fields). Interestingly, the removed agent is still able to solve the task involving form filling, but it is learned to accomplish this task from a human trajectory by highlighting the text and dragging it to the relevant text box. It's worth noting that it wasn't easy for agents to implement this drag in the environment of the original Selenium version.

Figure 8b also shows the results of an ablation experiment in which an agent uses an alternative action that interacts with a specific DOM element. This means that agents cannot solve tasks that involve clicking a specific location within the canvas, dragging, or highlighting text.

WAIC 2022 Shanghai Artificial Intelligence Developers Conference Carnival – Amazon DeepRacer Championship Tournament

WAIC 2022 Shanghai Artificial Intelligence Developer Conference will be held on February 26 in Lingang, Shanghai. There will be four carnivals themed "AI Life in the Age of Intelligence" on the day of the event, with the Amazon DeepRacer Championship Tournament kicking off in the afternoon.

AmazonDeepRacer is a 1/18 self-driving car launched by Amazon Cloud Technology that uses cameras to view the track and use reinforcement learning models to control the throttle and steering wheel. Users can test reinforcement learning models in a simulated environment or on a real track for racing.

Get started with AI in 1 hour and build your first reinforcement learning model! Come and join the gods on the advanced journey of "Fast and Furious"!

Identify the poster QR code below and register now.

The AI keyboard man is coming: DeepMind begins training agents to "play" with computers like humans

Read on