laitimes

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

On the morning of April 13, Beijing time, the xAI official account of the X platform released a preview of the "Grok-1.5 Vison" version, and announced at the same time that the company is recruiting designers, engineers, products, data, architecture, and annotators in the Bay Area. Musk also appealed: "Please join xAI!"

xAI's Careers Page: Building AI to Drive Human Progress [1]

As can be seen from the introduction on the teaser page and the seven examples of flexing muscles, the Grok-1.5V version focuses on multi-modal visual information processing, benchmarking against GPT-4 and emphasizing "the ability to understand the physical world".

What's more interesting is that before the version was released, xAI released a set of benchmark questions for understanding the physical world "RealWorldQA". Readers who have a strong interest in this collection of questions can also skip directly to the second part of this article.

The first part below is the official introduction of xAI, a rough translation of rubble. [2]

1. Official introduction

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

Introducing Grok-1.5V, our first generation multimodal model. In addition to its powerful text capabilities, Grok can now handle a wide range of visual information, including documents, block diagrams, diagrams, and photos. The Grok-1.5V will soon be available to our early testers and existing Grok users.

ability

From multidisciplinary reasoning to understanding documents, scientific block diagrams, diagrams, screenshots, and photos, the Grok-1.5V is comparable to existing cutting-edge multimodal models in multiple domains. We are particularly excited about Grok's ability to understand the physical world. In our new RealWorldQA benchmark, Grok outperforms its peers. For all of the datasets below, we evaluated Grok without a chain of thought prompt.

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

Example 1 Generate code from a block diagram

User: Can you translate this into Python code?

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

Grok: Absolutely! The flowchart you provided depicts a simple guessing game: the computer generates a random number and the user has to guess it. Here's the Python code that represents the logic in the flowchart:

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

This code will run the guessing game described in the flowchart. Users will be prompted to enter their guess until they guess the correct number.

Example 2: Counting calories

User: How many calories are in 5 tablets?

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

Grok: If one serving is 3 tablets and contains 60 calories, then 5 tablets contain about 100 calories. The calculation is done by first determining the number of calories per tablet: 60 calories / 3 tablets = 20 calories per tablet. Then, multiply the calories per slice by the number of slices: 20 calories per slice * 5 slices = 100 calories.

Example 3: From hand-drawn to bedtime stories

User: This is a painting by my son, can you tell a little bedtime story based on his drawing?

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

Grok: Once upon a time, in a land of adventure and wonder, there was a brave little boy named Timmy. Timmy lives in a cozy little house with his family, and his favorite thing to do is explore the world around him.

On a sunny day, Timmy decides to go on an adventure in the nearby forest. He brought his favorite snacks, a compass, and his trusting notebook, ready to draw all the exciting things he saw.

When Timmy walks into the depths of the forest, he discovers a magical river that sparkles in the sun. The river is so charming that Timmy decides to make a small paper boat and go down the river as a special message to the forest elves.

Timmy carefully folds the paper boat and draws a smiling face on it, hoping to bring joy to the forest elves. He then put the boat on the river and watched as the boat floated away with his friendship and love for nature.

As the sun sets, Timmy knows it's time to go home. He was glad he had spent this magical day in the forest, knowing that he would always remember the expedition.

Upon returning home, Timmy shared his story with his family, who were very happy to hear about his magical day in the forest. They agreed that Timmy's brave, kind spirit was a real inspiration, and they knew that Timmy would have many more great adventures to come.

And so, with a heart full of love and a head full of dreams, Timmy drifts off to sleep, eager to discover what new adventures await him in his imaginary world.

例子4 解释梗图(meme)

User: I don't understand, please explain.

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

Grok: This graph humorously contrasts startups with large companies. On the left side labeled "Startups", there is a group of people digging holes together. They are all actively involved, and the hole has been dug very deep.

To the right, marked "Big Company", there is a group of people standing around a hole. However, unlike startups, only one person is actually digging a hole, and everyone else is watching or doing something else.

The humor of this picture comes from exaggerating the differences between startups and large companies. In startups, there is often a sense of urgency that everyone will contribute directly to the task at hand. In contrast, large companies tend to be portrayed as more bureaucratic, with less direct involvement of all team members in each project.

Example 5 Convert a table to a CSV file

User: Please convert this form to a CSV file

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

Grok: The table in the image can be converted to CSV format as follows:

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

Note that the CSV format does not support styling or formatting, so the information is displayed in plain text.

Example 6 Helping to dispose of rotten wood on a terrace

User: These holes have appeared around the screws on my patio. Is it rotting? Should I replace the planks?

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

Grok: Looks like the wood on your patio has started to rot. This manifests itself in the appearance of holes around the screws, which are a common sign of wood decay.

To ensure the safety and longevity of the terrace, it is advisable to replace the affected planks. This will help prevent further damage to the patio and maintain its structural integrity.

If you're not sure how to replace the planks, it may be helpful to consult with a professional patio builder or contractor. They will be able to assess the extent of the damage and provide you with expert advice on the best course of action for the patio.

Example 7: Solving a programming problem

User: Can you write Python code to solve this problem?

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

Grok:

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

Real-world comprehension

In order to develop useful real-world AI assistants, it is essential to improve the model's ability to understand the physical world. To that end, we've launched a new benchmark – RealWorldQA. This benchmark is designed to assess the basic real-world spatial understanding of multimodal models. While many of the examples in current benchmarks are relatively easy for humans, they often pose a challenge to cutting-edge models.

The original release of RealWorldQA contained more than 700 images, each with a question and an easily verifiable answer. The dataset includes, among other real-world images, anonymized images taken from vehicles. We are excited to release RealWorldQA to the community and intend to continue to expand as the multimodal model improves. RealWorldQA is released under the CC BY-ND 4.0 license. Click here to download the 677MB dataset. [3]

Looking to the future

Advancing our multimodal understanding and generative capabilities is an important step in building beneficial artificial general intelligence (AGI) that can understand the universe. Over the next few months, we expect to significantly improve both capabilities across various modes such as image, audio, and video.

二 RealWorldQA基准测试集

Interested readers can download the RealWorldQA test set from the link [3] at the end of the article. The test set consisted of 765 images in webp format and a JSON file containing all the corresponding questions and answers. That is, a graph corresponds to a question, and then the test set provides a standard answer.

According to the preliminary examination of the rubble, the questions are basically multiple-choice, true/false, or can be answered in one word. Images can be broadly divided into two categories:

A. The lane, road surface, surrounding vehicles, and surrounding environment should be captured by the on-board camera. This category accounts for about 70%, more than 530 sheets.

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

The question in this picture is: Which direction are the front wheels of the car on the right?A. Left B. Forward C. Right side

B. Some outdoor scenes, indoor objects, small objects, animals and plants that are often seen in daily life. This kind accounts for about 30%, more than 230 photos.

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

The corresponding question is: which is the second highest shot?A. the middle shot B. the first shot on the left C. the first shot on the right

The pictures and questions in Categories A and B are at least in line with AI's ability to perceive and understand the physical world in the use scenarios of autonomous driving and home robots.

Similar topics, maybe FSD and Optimus students from the company next door used to be and are even now frantically brushing every day. Looking forward to seeing Grok and all the real-world-based AIs compete against this benchmark question bank.

Here we select 2 A and 8 B questions, a total of 10 questions. Interested readers can do it themselves, or invite the children at home to brush it together and experience the sourness of Grok's daily brushing.

Please take a note with the question number and your answer, and proofread the answer at the end of the text. Welcome to leave a message at the end of the article to record your achievements and feelings.

3. Brush the questions of the AI, so that the AI can be brushed without questions

9. How many meters is the distance between the pickup truck ahead and the nearest curb?

A. 5.5米  

B. 9.5 m  

C. 13.5 m

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

18. How many plastic bags are there?

A. There are three plastic bags here.

B. There are four plastic bags here.

C. There are two plastic bags here.

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

22. What is the direction of travel of the car closest to us?

A. Towards us 

B. Stay away from us

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

33. How many short cones with stripes are there in this picture?

A. There are 2 short cones with white stripes.

B. There are 5 short cones with white stripes.

C. Has 1 short cone with white stripes.

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

43. How many plants are in this picture?

A. There are no plants in this picture.

B. There are two plants in this picture.

C. There are three plants in this picture.

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

44. Where is this squirrel relative to the window?

A. The squirrel is not near the window.

B. The squirrel was looking in from the window.

C. The squirrel is far away from the window.

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

48. How many pink headbands are in this picture?

A. There is 1 pink headband.

B. Has 2 pink headbands.

C. Has 3 pink headbands.

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

388. In which relative position of the elephant to the bird?

A. The elephant is on top of the bird

B. The elephant is on the right side of the bird

C. The elephant is underneath the bird

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

715. Which cup is taller?

A. They are the same height.

B. The cups in front are higher.

C. The cup at the back is taller

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

764. In which direction is this dinosaur facing?

A. North

B. South

C. East

D. The West

Musk's Grok 1.5V version flexes its muscles, demonstrating its "ability to understand the physical world"

Answer:

9A / 

18C / 22A / 33C / 43B / 44B / 48A / 388C / 715C / 764C

Read on