Intelligence Weekly|This Summer's Battle of Large Models: Real Reasoning Ability

Written by | Neocortex Group

Edit | Wu Yangyang

More and more commercial products are being developed based on generative AI (GenAI). This week, Google is considering charging for AI-based search services, TikTok will launch AI virtual anchors, and the AI Pin, known as the "world's first AI-native hardware", also went on sale this week at $699. Compared to the previous generation of menu-based applications developed with rule-based algorithms, these new products open up new ways to interact with users, allowing users to get services by talking directly to the AI.

However, the commercialization of these GenAI-native applications has been slower than expected, and one of the main reasons is still the limitation of model capabilities. The results of the test Vals.AI by the model evaluation agency show that GPT-4 still ranks either first or second in the list of tests from various dimensions, and only in some cases will Anthropic's latest large-size model, Claude 3 Opus, surpass it. In other words, GPT-4, a model released a year ago, is still the most "smart" in the world. This state of affairs is a good thing for OpenAI, but it is the opposite for the industry as a whole.

The lack of accuracy makes it difficult for GenAI to enter more productive fields such as finance, tax, and law. Vals.AI tests show that only three models, GPT-4, Claude 3 Opus, and Claude 3 Sonnet, have an accuracy rate of more than 60% for finance-related tasks, while GPT-4, the best performing model, has an accuracy rate of only 54.5% for tax-related tasks, and most of the other models have an accuracy rate of less than 40%. Five models, including Opus and GPT-4, all had an accuracy rate of more than 70%, with GPT-4 achieving an accuracy rate of 77.7% on this task. However, whether it is 60% or 77.7%, this level of accuracy is not enough for commercial use in serious scenarios such as finance, taxation, and law. The same is true for autonomous driving.

There has been a consensus in the industry that GenAI, represented by GPT, lacks real inference capabilities, and major companies plan to solve this problem in the new models that will be launched in the future. This week, executives at OpenAI and Meta both said they were preparing to launch the next version of their large language model. "Today's AI systems are very good at small one-time tasks. OpenAI's chief operating officer, Brad Lightcap, said that the next generation of GPT will show progress in solving "difficult problems" such as inference. Yann LeCun, Meta's chief AI scientist, said that Meta is developing AI "agents" that allow it to plan and book every step of the journey from someone's office in Paris to another office in New York, which requires strong reasoning and planning capabilities to complete the disassembly, sequencing and execution of tasks.

Similar to OpenAI's plan to release GPT-5 this summer, Meta is also preparing to release a series of model sizes of Llama 3 in the coming months, with the smaller Llama 3 hitting the shelves next week. Whether GenAI can deliver on the valuation of the capital market depends on whether the new generation of models with inference capabilities, such as GPT-5 and Llama 3, can deliver on its promises. Otherwise, the commercial value of GenAI will be greatly reduced, whether it is used as a language and image translation tool, or as a decision-making tool, corresponding to different levels of industrial value.

The following is a summary of the smart news worth watching in the past week, produced by the Neocortex team.

Key Points

Large model

Cohere推出新模型Command R+，更强调RAG；

Meta计划下周推出小版本Llama 3;

Apple has released a large model "Ferret-UI" that tries to read the screen of a mobile phone;

Applications

Google considers charging for AI-based search services;

TikTok将推出图片分享应用「TikTok Notes」;

TikTok将推AI虚拟主播;

AI Pin发售,售价699美元;

Vals.AI to do large-scale model evaluation business;

Talent & Funding

Microsoft opens new AI center in London;

xAI seeks to raise $3 billion at $18 billion valuation;

Facewall Intelligent has completed hundreds of millions of yuan in financing.

Large model

Cohere Launches New Model Command R+ with More Emphasis on "Search Enhancement"

On April 5, Cohere officially announced its new generation of large model Command R+, which is less than a month after the launch of its previous large model, Command-R. Command R+ has 104 billion parameters, a context window that supports 128K, and 10 languages including English, Chinese, French, and German. According to Cohere, Command R+ performs better than Mistral Large and second only to GPT-4 Tubro. Compared to its predecessor, Command R+ has enhanced its built-in RAG (Retrieval Enhanced Generation) capability. "Neocortex" previously reported that Cohere's goal is shifting from chasing cutting-edge models to RAG (Retrieval Enhanced Generation). Previously, Cohere had spent a lot of money chasing the latest model capabilities of OpenAI and Anthropic, but recently Cohere's leaders decided not to compete with companies like OpenAI and instead focus on developing the largest and most advanced AI models as a top metric, and instead focus on strengthening the RAG technology of large models.

Meta计划下周推出小版本Llama 3

Meta is about to launch a new large language model, Llama 3, which is benchmarked against GPT-4 developed by OpenAI. The company plans to launch two smaller Llama 3 sizes next week, with the largest version scheduled for this summer. Unlike the two small-sized models that will be released soon, the largest version of Llama 3 is multimodal and may have more than 140 billion parameters. The last version of the Llama 2 was launched in February 2023 and is also available in 3 different sizes.

Apple has released a large model called "Ferret-UI" that tries to read the screen of a mobile phone

On April 8, Apple released Ferret-UI, a multimodal model tailored to understand mobile UI screens, which can "understand" the UI interface of mobile phones and perform corresponding tasks. Ferret-UI is trained to perform tasks on mobile user interface screens, such as widget classification, icon recognition, and optical character recognition, through different input formats (dots, boxes, doodles) and basic tasks (find widgets, find icons, find text, widget lists). Ferret-UI is the second large model developed by Apple for AI to understand UI, and the neocortex has reported that Apple has developed a paper published on March 29 that Apple has developed a model called "ReALM", which can understand information on the phone screen, and Apple believes that this is a key step in realizing the use of voice AI such as Siri to operate the phone. Currently, Apple is considering introducing a third-party model to implement smart features on the iPhone, and Google's Gemini and Baidu's Wenxin Yiyan are both potential partners. The release of the two models ReALM and Ferret-UI means that Apple has not given up on controlling mobile phones with its own models.

Applications

Google is considering charging for AI-based search services

Google is reportedly considering overhauling its main source of revenue, the search engine, including adding an AI-based search feature to its premium subscription service, which charges subscribers $20 a month. Since 2000, Google's search business has been primarily based on advertising, and this reform move could be one of the biggest changes in Google's history. Google's AI search service is able to generate a complete answer based on the search content and then serve it to the user, rather than just displaying a series of relevant web pages like traditional search engines. Google began testing the AI search service in May last year, and now it's considering adding it to its premium subscription. Currently, Google's premium subscription service costs $20 per month, and users can use the latest Gemini Ultra 1.0 chatbot in the service, with the ability to use the AI capabilities provided by Gemini in productivity suites such as Gmail, Docs, and Sheets. If AI search is added to that subscription menu, it will be the first time that Google has placed its core business behind a paywall.

TikTok将推出图片分享应用「TikTok Notes」

On April 9, local time, TikTok users received an app pop-up window showing that the company will launch a new app for sharing photos, called TikTok Notes. TikTok later confirmed the news, saying that the company is developing an exclusive space for photo sharing, but has not yet finalized the design and release date of TikTok Notes.

TikTok Notes is an app for sharing photos. As planned, its initial content was derived from graphic posts that had already been published on TikTok. However, TikTok users can also opt out of having these photos shared with the new app by turning on the decline button. Last month, it was revealed that the app was originally called TikTok Photos. TikTok Notes is also seen as a competitor to Instagram due to its posting format and content, which is focused on photos. But compared to Instagram's wrap-around community style, TikTok's non-critical, snap-and-shoot social tone may help Notes reach a wider audience.

TikTok将推AI虚拟主播

On April 11, it was reported that TikTok is considering using AI to generate virtual anchors for advertisers who will star in short videos and introduce products on the ads, and may even compete with live streamers for ad deals. The feature that TikTok is working on will generate video ad scripts based on prompts provided by advertisers, as well as generate avatars who execute scripts in videos. This feature is not only available to advertisers, but also merchants in TikTok stores can use it to promote their products. This feature has been in beta for months, but it's still some way from launch. According to people familiar with the matter, based on the current test results, the number of transactions generated by these AI-generated videos is far less than the number of transactions created by human anchors. However, the feature is still in development and the final version is subject to change.

AI Pin发售,售价699美元

On April 11, startup Humane announced the official sale of its first AI hardware product, the AI Pin, with a starting price of $699. In addition to purchasing the basic hardware, users will need to subscribe to the $24/month service to use the basic features of the device. "New Cortex" has reported that in November last year, Humane announced the launch of AI Pin. The main feature of this device is that there is no screen, and the interaction methods include voice, tapping, etc. For example, users can press and hold to talk to the voice assistant AI Mic to complete operations such as querying information, translating languages, or playing music. Humane has raised $230 million in funding, and OpenAI CEO Sam Altman holds the company's largest outside stake. In addition to AI Pin, Humane plans to launch other AI hardware products in the future. In January this year, the startup Rabbit also launched an AI voice assistant hardware product, R1, which supports conversations to realize operations such as queries, taxis, and shopping. After the product was unveiled at CES, more than 50,000 units were sold within five days. However, the product has faced a lot of controversy since its launch, the biggest of which is that the features it provides can be solved through the app, and there is no need to design and purchase a separate hardware for this purpose.

Vals.AI want to do a large model evaluation business

On April 11, Vals.ai released a third-party industry evaluation of multiple large models. Every time a new large language model is released, its developers claim that their model performs as well or better than GPT-4, but the test results lack independence. As more and more companies consider whether to use AI for specific tasks, the need for "unbiased testing" is intensifying. Arash Rakhteh, a partner at Pear VC, said companies need more "nuance" to understand whether a particular AI model is "performing better" or "able to handle tasks at a lower cost". The performance evaluation reports of the major models in the fields of taxation, law and finance have been published on the official website of Vals.ai. The data shows that the performance of different models can vary from industry to industry. For example, Anthropic's Claude 3 Opus and OpenAI's GPT-4 both have an accuracy rate of around 77% on legal reasoning tasks, much higher than their performance on tax issues.

Talent & Funding

Microsoft opens a new AI hub in London

On April 8, Microsoft announced that the newly formed AI agency Microsoft AI plans to set up an AI center in London, led by Jordan Hoffmann, a former AI scientist and engineer at Inflection and DeepMind. In the future, Microsoft AI London will collaborate with Microsoft's AI team and OpenAI to develop large language models and their supporting infrastructure. Microsoft will also start recruiting talent for its new AI hub in London. Jordan Hoffmann joined Microsoft not long ago with its "hire-for-hire acquisition" Inflection. According to the latest deal details provided by The Information, Microsoft CEO Nadella met with all Inflection employees at a Hyatt hotel on March 19, after which two of Inflection's three co-founders and 60 of the 70 employees joined Microsoft, leaving the remaining 10 employees to maintain Inflection's toB services for existing enterprise customers.

xAI seeks to raise $3 billion at a $18 billion valuation

On April 5, it was reported that Musk's artificial intelligence company xAI is conducting a new round of financing, with a financing amount of $3 billion, and the company's valuation will reach $18 billion after the financing is completed, and the terms of the financing have not yet been finalized. Venture capital firm Gigafund and investor Steve Jurvetson are considering participating in the round. Both Gigafund and Steve Jurvetson have deep ties to Musk, with Gigafund founder Luke Nosek being one of the co-founders of PayPal, and Steve Jurvetson a former board member of Tesla and a current board member of SpaceX. Previously, in January, foreign media reported that xAI was raising $6 billion at a valuation of $20 billion, and Musk publicly denied the news at that time.

Facewall Intelligent has completed hundreds of millions of yuan in financing

On April 11, AI startup Facewall Intelligent announced the completion of a new round of financing of hundreds of millions of yuan, led by Primavera Venture Capital and Huawei Hubble, followed by Beijing Artificial Intelligence Industry Investment Fund, and Zhihu as a strategic shareholder. Founded in August 2022 by Zhiyuan Liu, a tenured associate professor in the Department of Computer Science at Tsinghua University, the core members of the team are from the Natural Language Processing Laboratory of Tsinghua University. In April 2023, Facewall Intelligence received tens of millions of yuan in angel round financing led by Zhihu, followed by Zhipu AI. In June, Li Dahai, partner and CTO of Zhihu, served as the CEO of Facewall Intelligence.

-END-

Intelligence Weekly|This Summer's Battle of Large Models: Real Reasoning Ability

Read on

12 domestic large models vs. college entrance examination mathematics, accidentally exploded a big bug

The last round of mathematics in the high school entrance examination is to check and fill in the gaps: auxiliary circle & hidden circle & maximum value model and its extended application

The last round of mathematics in the high school entrance examination to fill in the gaps: the Hu Bugui model and its extended application

The last round of mathematics in the high school entrance examination is to fill in the gaps: the model of the melon bean principle and its extended application

The last round of mathematics in the high school entrance examination is missing and filling: the Afch's circle maximum value model and its extended application

The final round of mathematics in the high school entrance examination is to fill in the gaps: the general's drinking horse model and its extended application

The final round of mathematics in the high school entrance examination: the Fermat point model and its extended application

Recommend an open-world object detection model: DINO 1.5

16 college entrance examination records! Use mathematical models to predict Tang Shangjun's 2024 college entrance examination scores!

Podcast Update|First Voter for MiniMax: MiniMax, GenAI Conference, and Big Model Playing Cards

CVPR 2024|Only one language model is needed to generate high-quality 360-degree scenes from image diffusion models

The Digital Transformation Maturity Model and Assessment was released

3 types of children "will be abolished as soon as the test is taken", Dr. Tsinghua's iron triangle model will help you become a master of the exam

绝对新鲜实惠图源：archiminibricks#乐高 #乐高MOC #积木#模型#大人也要玩玩具

Development Trend of Large Models: Multimodal, Autonomous Intelligence, Edge Intelligence...

The effect is benchmarked against Sora's domestic AI video application, and the large model of Kuaishou video generation can be unveiled