OpenAI secretly launched a mysterious model, suspected to be ChatGPT4.5 for public testing

author：The mountain monster Atu 2024-04-30 13:26:00

All the articles here are from the WeChat public account "Mars AIGC", the author of this article: Kaishan Monster. If you want to see more updated AI cutting-edge information, AI information and AI tool practice, please pay attention to the WeChat public account "Mars AIGC".

A mysterious model GPT 2 recently appeared on lmsys.org, and after my personal testing and feedback from online professionals, the output quality is surprisingly high, and its performance is not below the latest version of ChatGPT4-Turbo. One suspects that it is actually the ChatGPT4.5 model.

I covered chat.lmsys.org in an earlier article. This is a public testing platform for AI large models jointly launched by the University of California and Carnegie Mellon University, which is free to use without logging in. Users are able to chat with a variety of large language models and rate their output.

Free access to 33 large AI model lmsys.org, including GPT4 and Claude3

The performance demonstrated by this mysterious GPT2-chatbot model far exceeds that of any known pre-GPT-4 model. It can be chatted in Direct Chat in LMSYS or in Arena (Battle). But this GPT 2's direct chat is rate-limited, making it hard to grab it every day.

OpenAI secretly launched a mysterious model, suspected to be ChatGPT4.5 for public testing

BATTLEMODE chat is a blind test version used for benchmarking, where two models are randomly matched to answer questions at the same time. While model matching is random, GPT2-chatbot is more likely than any other model to be one of the candidates for BATTLEMODE, appearing far more often than it should, and easily matching it.

I ran several classic reasoning tests after matching gpt2-chatbot, and I undoubtedly answered all of them correctly, and even the result analysis was better than gpt4-turbo-2024-04-09.

Prompt "I wish yesterday was tomorrow, then today is Friday." What day of the week is today?". This reasoning question is mostly a hypothetical passage of time line, two different subject perspectives, and the answer is that both Sunday and Wednesday are correct.

这个星期几的问题,盲测了好几个模型,包括Gemini-1.5-pro、Llama3-70b、Claude3-sonnet、Phi-3-mini 、通义千问qwen1.5-14b等,除了Gemini-1.5-pro 回答正确外,其他模型都回答错了。

Prompt "I went to a party, I arrived before John, David arrived after Joe, Joe arrived before me, John arrived after David. Who arrived first?". The answer is Joe. Matched to gemini-1.5-pro and gpt2, both answered correctly. GPT 2 even pointed out Davie's misspelling in its answer.

The prompt "There are 10 birds in the book, the hunter has shot down one bird, how many birds are left in the tree?". Microsoft's phi-3-mini has 9 more stupid answers.

There are also tests on GPT2 programming skills online. Generate a rotating 3D cube with code. gpt2's code runs successfully, while gemini-1.5-pro generates the result and returns the error "OpenGL.error.NullFunctionError: Attempted to call undefined function glutInit, check bool(glutInit) before calling".

提示词：Write a Python script that draws a rotating 3D cube, using PyOpenGL.

(You need the following Python packages for this: pip install PyOpenGL PyOpenGL_accelerate pygame

All in all, the quality of the output – especially its format, structure, and overall understanding – is excellent, if not top-notch. No information about the GPT2 specific model name can be found online. The results generated by the LMSYS benchmark are available through its API for all models – except gpt2. So there's reason to suspect that this is the legendary ChatGPT 4.5 and that OpenAI may be disguising it as gpt2 to get the correct feedback from a "normal benchmark" test, rather than getting a biased rating because it's called GPT-4.5/5, which leads to high expectations.

Another speculation is that it could be a new model called GPT 2. More reason is the mention of a new architecture for GPT 2 in a paper earlier this month. The April 7 paper, Physics of Language Models: Part 3.3,

Knowledge Capacity Scaling Laws) mentions that "the GPT-2 architecture with rotational embedding matches or even surpasses the LLaMA/Mistral architecture in terms of knowledge storage, especially over short training durations." This happens because LLaMA/Mistral uses GatedMLP, which is less stable and difficult to train. ”

Link to paper: arxiv.org/abs/2404.05405

The paper was submitted by the Mohammed Bin Zayed University of Artificial Intelligence (MBZUAI) in the UAE and written by two Chinese people, Zeyuan Allen-Zhu and Yuanzhi Li. The university is the world's first graduate-level AI university, with Professor Shao Ling as the initiator, founding provost, and executive vice president, and the president being Dr. Eric Xing, former associate director of research in the Department of Machine Learning at Carnegie Mellon University.

Eric Xing 博士

Who exactly is GPT 2? I certainly hope that the result is the latter, so that we can see more new AI forces added.

OpenAI secretly launched a mysterious model, suspected to be ChatGPT4.5 for public testing

Read on