Apple released Ferret-UI, a multimodal model, and some mobile phone UI tasks surpass GPT-4V

author：AI Tech Review 2024-04-10 13:29:00

Will the device-side model function developed by mobile phone manufacturers surpass the pure large-scale model team?

Compile | Lai Wenxin

Edit | Chan Choi Han

The birth of large models has made technology giants and startups start again in a new round of competition, and the rise of entrepreneurial stars such as OpenAI, Anthropic, and Mistral proves that under the influence of new technologies, large manufacturers do not have absolute advantages.

Not long ago, Apple halted a multibillion-dollar self-driving electric car project that had been launched for more than a decade, laid off more than 600 employees at its U.S. headquarters, and moved nearly 2,000 employees to the AI department.

However, among the mainstream smartphone brands on the market today, Apple is almost the only manufacturer that has not officially launched a large model. Apple, which has been in the lead for a long time, seems to have fallen behind in a rare way in the big model.

On April 8, Apple unveiled a new job called "Ferret-UI", a multimodal model that can "read" the phone screen and perform tasks, customized to enhance the understanding of mobile UI screens, equipped with referring, grounding, and reasoning functions.

Paper link: https://arxiv.org/pdf/2404.05719.pdf

Half a year ago, the multi-modal large model "Ferret" jointly released by the research team of Apple and Columbia University already has a high ability to associate images and texts, while "Ferret-UI" is more mobile-focused and focuses on user interaction.

The research team believes that Ferret-UI has the ability to understand and effectively interact with user interface (UI) screens, which is lacking in most existing general-purpose multimodal large models.

The UI task performance surpasses GPT-4V

With the focus on the UI, what does Ferret-UI look like?

Apple's team compared the performance of Ferret-UI-base, Ferret-UI-anyres, Ferret, and GPT-4V on all UI tasks, and included the open-source UI multimodal models Fuyu and CogAgent in the comparison for advanced tasks.

The first is basic UI task performance testing.

Ferret-UI has demonstrated superior performance on most basic UI tasks, especially on iPhone-related tasks, outperforming Ferret and GPT-4V on all tasks except the "Find Text" task.

On basic UI tasks such as OCR (Optical Character Recognition), icon recognition, and control classification, Ferret-UI has an average accuracy of 72.9%, 82.4%, and 81.4%, respectively, far exceeding the average accuracy of GPT-4V, which is 47.6%, 61.3%, and 37.7%, respectively.

GPT-4V's performance drops significantly on Android tasks, especially on positioning tasks, which may be due to the fact that there are more and smaller widgets on the Android screen, making the positioning task more challenging.

It is worth mentioning that in the OCR task, the model predicts the text next to the target area, not the text within the target area. This is common for smaller texts and texts that are very close to other content.

Ferret-UI, on the other hand, is able to accurately predict partially severed text, even if the OCR model returns incorrect text.

Ferret-UI also shows superior performance for positioning tasks such as finding text, finding icons, and finding controls.

And in the competition for the performance of advanced UI tasks, Ferret-UI also performs well.

Ferret-UI has demonstrated performance comparable to GPT-4V and surpassed GPT-4V in some tasks for advanced tasks such as detailed description (DetDes), perceptual dialogue (ConvP), interactive dialogue (ConvI), and functional inference (FuncIn).

Compared with the open-source UI multimodal models Fuyu and CogAgent, Ferret-UI surpasses most tasks. On the iPhone platform in particular, Ferret-UI's performance score is significantly higher than that of Fuyu and CogAgent.

Moreover, although Ferret-UI's training dataset does not contain specific Android data, it still shows considerable performance on high-level tasks on the Android platform, indicating that the model has the ability to transfer UI knowledge between different operating systems.

Anyres technology solves

Screens with different aspect ratios are a problem

So, how does Ferret-UI excel in multiple UI tasks?

A key innovation of Ferret-UI is the introduction of any resolution (anyres) technology on top of Ferret. This technology was proposed to solve the problem of diverse aspect ratios in the UI of mobile devices.

While Ferret-UI-base closely follows Ferret's architecture, Ferret-UI-anyres incorporates additional fine-grained image features, notably a pre-trained image encoder and projection layer that generates image features for the entire screen.

For each sub-image obtained from the original image aspect ratio, additional image features are generated, and for text with region references, a visual sampler generates corresponding region continuous features.

Large language models (LLMs) use full-graph representations, subgraph representations, regional features, and text embeddings to generate responses.

Ferret-UI-anyres架构

But what's so special about Anyres technology?

Traditional models may require fixed-size inputs, but mobile devices such as mobile phones have different screen sizes and aspect ratios, which obviously poses a challenge for model inputs.

To accommodate this, Ferret-UI splits the screen into multiple sub-images, which allows each sub-image to be zoomed in to capture more detail.

Specifically, for each sub-image obtained based on the aspect ratio of the original image, additional image features are generated. For text with region references, the Visual Sampler generates the corresponding region contiguous features.

This method is not only suitable for screens with different aspect ratios, but also improves the model's ability to recognize details of UI elements, and is able to highlight small objects on the screen, such as icons and text, which is essential to improve the recognition and positioning accuracy of the model.

In addition, Apple's research team has designed a hierarchical experimental approach, from simple to complex, to gradually improve the capabilities of the Ferret-UI model.

Starting with the basic recognition and classification tasks, the Ferret-UI model builds a basic understanding of UI elements, learning to identify and classify UI elements, laying the foundation for handling more complex tasks.

This is followed by a gradual transition to conversational and inference tasks that require a higher level of understanding. As the model's capabilities increase, the task becomes more complex, requiring the model to not only recognize UI elements, but also understand their function and context. The design of the advanced tasks provides the model with the necessary background knowledge and understanding to handle complex UI interactions.

Hierarchical task design not only helps the model learn step-by-step, but also ensures that the model has sufficient background knowledge and understanding in the face of more complex UI interactions. In this way, Ferret-UI is better able to understand and respond to the user's instructions, providing more accurate and useful interactions.

From basic recognition and classification to advanced description and inference, Ferret-UI provides accurate and useful responses to real-world UI interactions. Combined with Anyres technology to handle screens of different resolutions, it further enhances its effectiveness and user experience in real-world applications.

epilogue

In the face of the current fierce large-scale model "battle", technology giants urgently need to think about how to make a layout of market strategies and products that keep pace with the times, and Apple is no exception.

Whether it's Ferret-UI, Ferret, the predecessor of Ferret-UI, or ReALM, which aims to improve interaction with voice assistants, Apple is advancing the research of models that can read information on the screen.

Ferret-UI can provide high-quality UI understanding and interaction on mobile devices, but could it be a powerful tool for the iPhone to introduce AI and take Apple from a slightly behind-the-scenes position?

Let's wait and see.

The author of this article, anna042023, will continue to pay attention to the development trend of personnel, enterprises, business applications and industries in the field of AI large models.

Apple released Ferret-UI, a multimodal model, and some mobile phone UI tasks surpass GPT-4V

Read on