Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?

36 Krypton

2024-04-25 19:55Posted on the official account of Beijing 36Kr

Text: Li Ran, Chen Star

Editor|Su Jianxun

On April 24, local time in the United States, Apple released its own open-source "small model" family on Hugging Face - 4 pre-trained large model OpenELM.

Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?

Source: X

The four models are extremely small, with parameters of 270M, 450M, 1.1B and 3B, respectively.

图源：Hugging Face

On the Hugging Face page, Apple says that Open-source Efficient Language Models (Open-source Efficient Language Models) can perform text-related tasks such as email writing with high efficiency. The series model has been open-sourced and available to developers.

In the related paper, published on April 22, the researchers introduced the entire framework of OpenELM, including data preparation, training, fine-tuning, and evaluation results.

Source: Paper

Address: https://arxiv.org/pdf/2404.14619.pdf

CoreNet address: https://github.com/apple/corenet

The model can be downloaded from https://huggingface.co/apple/OpenELM

The model is really open source, but the capabilities are really average

Apple, which has always been known for its closedness, suddenly joined the open source camp with a very radical attitude in the era of large models.

This OpenELM not only provides model downloads, but also open-sources very important information related to models:

The model weighting and inference code also includes a complete framework for model training and evaluation on public datasets, covering training logs, multiple save points, and pre-training settings, and open-source CoreNet, a deep neural network training library

The training library enables researchers and engineers to develop and train a variety of standard and innovative small and large models for a variety of tasks, such as basic models (e.g., CLIPs and large language models (LLMs)), object classification, detection, and semantic segmentation.

OpenELM adopts the strategy of assigning parameters by layer, which effectively improves the parameter configuration efficiency of each layer of the Transformer model and significantly improves the accuracy of the model. With a budget of about one billion parameters, OpenELM's accuracy is improved by 2.36% compared to OLMo, and the number of tokens required for pre-training is halved.

Source: Paper

The paper revealed that the model was trained on 128 A100/H100 GPUs, and the maximum model training time was 13 days.

Source: Paper

The largest model is only 3B, and it can be seen that Apple's models in this series are only designed for on-premise deployment at the device and desktop levels.

The paper also reveals that all of the test benches are home-grade devices:

Intel i9-13900KF CPU, 64 GB内存, 英伟达RTX 4090 GPU，24G显存Apple MacBook Pro，M2 Max ，64G内存

In terms of performance, the model seems to be designed for research purposes, and the results on some common test sets are not high. Compared with mainstream SLMs such as the Phi series models launched by Microsoft, the gap is obvious.

Source: Paper

Phi-3 can reach the level of about 70 on a 5-shot MMLU, while OpenELM has less than 30.

Source: Paper

In response to this problem, netizens also made some speculations about the reason.

Source: X

The dataset used is small, and only publicly available datasets are used, and I personally believe that they are just conducting targeted research on training larger models in the future.

Users in the open source community also tested the model for the first time, and the overall feedback was that the model seemed to be too "aligned", in other words, it might be a bit of nonsense.

Source: X

Judging from the feedback from the current open source community, OpenELM does not seem to be a well-designed and trained model to flex muscles, so the performance and performance are not far from the leading models of the same size.

In the paper, the researchers also did not put too much emphasis on the capabilities of the model, but struggled with accuracy and inference performance.

Last year, there was an open source action, and the technical strength is still waiting for the 6 Moon Sword

After giving up the car, Apple has become more and more frequent in the large-scale model war.

Many times, "buy, buy, buy" is one of the main impressions of Apple's AI layout.

On March 15, Apple acquired Canadian AI startup DarwinAI. The own AI team has expanded dozens of technical personnel at once. On April 23, it was revealed that as early as December last year, it had quietly acquired Datakalab, a Parisian AI startup. Founded in 2016, the company also focuses on low-power, high-efficiency deep learning algorithms.

Apple's two recent acquisitions have both revolved around large models on the device side — for example, DarwinAI wants to make AI systems "small and sophisticated", and Datakalab specializes in low-power, high-efficiency deep learning algorithms that can run without relying on cloud-based systems.

Also in March, Apple was revealed to be in talks with Google to integrate Gemini into the new iPhone. In addition, it was revealed that Apple has also held discussions with OpenAI to consider using its model.

is not just "recruiting", on the research side, Apple, which started a little late, did not forget to "roll up".

In October 2023, Apple released an open-source LLM called Ferret. This model combines computer vision and natural language processing to identify objects and areas in images, convert text into visual elements, and conduct textual conversations related to images.

At the beginning of April 2024, based on Ferret, Apple released Ferret-UI, a multimodal large model (MLLM), which shows extraordinary UI screen comprehension capabilities - not only better than most open-source UI MLLMs, but also surpasses GPT-4V on all basic UI tasks.

Source: Paper

Previously, Apple's secrecy principle accompanied by a closed ecosystem that once prevented external developers from intervening. In the beginning, the Ferret study did not receive much attention, and it was open source under a non-commercial license and could not be used for commercial purposes.

But at the end of December, two months after the release, Bart De Witte, the operator of the AI medical nonprofit, reacted - it turned out that Apple had joined the open source community in October and had not noticed this important release.

Source: X

It was at this point in time that Ferret was once again hotly discussed—this anti-Apple's previous secrecy stance showed its openness to AI.

It can be said that before Cook announced his generative AI plan at the earnings conference in February this year, Apple's own AI research progress has been a lot. In December 2023, it launched MLX, an open-source array framework specifically for machine learning on Apple silicon. In February 2024, MGIE, an image editing model, was released, allowing users to describe in simple language what they want to change in their photos without having to go through photo editing software.

In March 2024, the "MM1" multimodal large model introduced by Apple in the paper also has image recognition and natural language reasoning capabilities. However, compared to other large models, the effect of MM1 is not amazing. Apple is just experimenting around MM1 and discovering the key factors that affect the effectiveness of the model.

MM1's paper points out that neither open source nor closed source really shares the process of reaching the algorithm design. Therefore, Apple hopes to break the situation with the research of MM1 and disclose the details of model training in the paper.

Similarly, the OpenELM model does show the progress of the device-side model, but the technology does not seem to meet the expectations of the outside world.

Perhaps, this time, Apple has once again expressed its determination to "open" by releasing a complete training and evaluation framework. The paper stated:

This general release is expected to strengthen and consolidate the open research community and pave the way for future open research work.

Therefore, the effect of OpenELM is average, and netizens will still be surprised by Apple's openness.

Source: X

Apple's true AI prowess won't be revealed until June's Worldwide Developers Conference (WWDC). But the "gesture" made by open source can be regarded as a performance in place in a few months.

Dissertation Focus

Model Architecture

Apple's researchers used the Transformer architecture, which only included a decoder, but made some special adjustments:

In the linear layer, no learnable bias parameters are set, RMSNorm is used for pre-normalization, and Rotational Position Embedding (ROPE) is used to encode position information, Packet Query Attention (GQA) is used to replace the traditional Multi-Head Attention (MHA), the traditional Feedforward Network (FFN) is replaced by SwiGLU, the FFN is calculated by lightning attention mechanism, and the Tokenizer with the same Tokenizer as LLama is used for text processing

The biggest difference between OpenELM and traditional large language models is that usually large models use the same configuration in each layer of Transformer, while OpenELM sets different configurations for each layer (such as the number of headers and the size of the feedforward network), so that the number of parameters in each layer is different.

This approach allows OpenELM to make more efficient use of the parameter budget, resulting in higher model accuracy. This non-uniform distribution of parameters between layers is achieved through "interlayer scaling" (also known as inter-block scaling).

Pre-training data and training details

The researchers used only publicly available datasets for pre-training.

Specifically, it includes some data from RefinedWeb, deduplicated PILE, RedPajama, and Dolma v1.6, totaling about 1.8 trillion tokens.

Judging from the public data sources provided by Apple, the data includes various mainstream online communities and encyclopedia knowledge platforms such as arXiv, Wikipedia, Reddit, GitHub, etc.

Source: Paper

It's worth mentioning that Apple doesn't use pretokenized data, but uses instant filtering and word segmentation to process text data. This practice makes it easy for researchers to experiment with various tokenizers, greatly simplifying the prototyping and research process. In their experiments, they used the same tokenizer as LLama.

Training results

研究人员将OpenELM与一些公开的大语言模型进行了对比，包括PyThia、Cerebras-GPT、TinyLlama、OpenLM、MobiLlama和OLMo。

Source: Paper

The closest performance to OpenELM is MobiLlama and OLMo. Both models are pre-trained on larger datasets.

As can be seen from the figure above, the accuracy of OpenELM improves with the increase of the number of training iterations, and the accuracy increases significantly in most tasks.

In addition, by averaging the processing of the last five checkpoints, which are collected every 5000 iterations, it shows accuracy comparable to or slightly better than the final checkpoints obtained after 350k iterations.

Source: Paper

The experimental results in the above figure show that OpenELM is used in various evaluation frameworks. have shown effectiveness beyond existing methods. For example, an OpenELM variant with 1.1 billion parameters improved accuracy by 1.28%, 2.36%, and 1.72%, respectively, across different evaluations when compared to OLMo with 1.2 billion parameters, and this was achieved using less than half of the pre-training data.

Source: Paper

After instruction tuning, the results in the above figure show that instruction fine-tuning consistently improves the average accuracy of OpenELM by 1-2% across different evaluation frameworks.

Inference performance

The researchers tested the model's inference performance on both PC and Mac platforms, which were described at the beginning of the two articles.

It can be seen that the M2 Max platform, which represents the mainstream configuration of Mac, can achieve 34 tokens per second in inference performance when running the 3B model, which has basically exceeded the reading speed of humans.

Source: Paper

In the top-of-the-line PC configuration, the inference speed of the 3B model reached 70.

Source: Paper

Although OpenELM has higher accuracy with similar parameter amounts, it has a slower inference speed than OLMo.

The analysis shows that a significant part of the OpenELM processing time can be attributed to the initial implementation of RMSNorm (shown in the figure below).

Source: Paper

Specifically, the implementation of a rudimentary RMSNorm results in many individual kernels starting up, each processing a small number of inputs, rather than starting a single fusion kernel as with LayerNorm.

By replacing the primary RMSNorm with Apex's RMSNorm, the inference speed of OpenELM increases significantly.

However, there are still significant performance gaps compared to models using optimized LayerNorm, in part because:

OpenELM has 113 RMSNorm layers, while OLMo has 33 LayerNorm layers, and Apex's RMSNorm is not optimized for small inputs

To further illustrate the performance degradation due to RMSNorm, the researchers replaced the LayerNorm in OLMo with RMSNorm, and observed a significant decrease in generation throughput. In future work, the researchers plan to explore optimization strategies to further improve the inference efficiency of OpenELM.

View original image 55K

Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?
Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?

Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?

Apple sent 4 open source "small models", and the running score is less than half of that of Microsoft Phi-3, and it does not roll up the performance volume efficiency?

Read on