laitimes

Crush LLaMA, "Falcon" is completely open source! 40 billion parameters, trillion token training

author:New Zhiyuan

Editor: Run Layan

The free commercial open source big model from the UAE topped the Hagging Face ranking, and the spring of AI big model entrepreneurs has arrived.

In the era of big models, what is most important?

LeCun's answer: open source.

Crush LLaMA, "Falcon" is completely open source! 40 billion parameters, trillion token training

When Meta's LLaMA code was leaked on GitHub, developers around the world had access to the first LLM to reach GPT level.

Next, various LLMs give various perspectives to AI model open sourcing.

LLaMA paved the way for models like Stanford's Alpac and Vicuna to set the stage for them to become leaders in open source.

At this time, the falcon "Falcon" fought again.

Crush LLaMA, "Falcon" is completely open source! 40 billion parameters, trillion token training

Falcon Falcon

Developed by the Institute of Technology Innovation (TII) in Abu Dhabi, UAE, Falcon performs better than LLaMA in terms of performance.

Currently, there are three versions of "Falcon" – 1B, 7B and 40B.

TII says Falcon has the most powerful open-source language model to date. Its largest version, the Falcon 40B, has 40 billion parameters, which is still a little smaller than the LLaMA with 65 billion parameters.

Although the scale is small, the performance can be beaten.

Faisal Al Bannai, Secretary General of the Advanced Technology Research Council (ATRC), believes that the launch of "Falcon" will disrupt the way LLM is acquired and allow researchers and entrepreneurs to come up with the most innovative use cases.

Crush LLaMA, "Falcon" is completely open source! 40 billion parameters, trillion token training

Two versions of FalconLM, Falcon 40B Instruct and Falcon 40B, topped the top two on the Hugging Face OpenLLM chart, while Meta's LLaMA was in third place.

Crush LLaMA, "Falcon" is completely open source! 40 billion parameters, trillion token training

It's worth mentioning that Hugging Face evaluated these models through four benchmarks of current comparison manifolds – AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA.

Although the "Falcon" paper has not yet been publicly released, Falcon 40B has been extensively trained on a carefully screened 1 trillion token network dataset.

The researchers revealed that "Falcon" attaches great importance to achieving high performance on large-scale data during the training process.

What we all know is that LLM is very sensitive to the quality of training data, which is why researchers spend a lot of effort building a data pipeline that can process efficiently on tens of thousands of CPU cores.

The aim is to extract high-quality content from the web on the basis of filtering and deduplication.

Currently, TII has published a refined web dataset, which is a carefully filtered and deduplicated dataset. It has proven to be very effective.

Models trained on this dataset alone can match or even outperform other LLMs. This demonstrates the excellent quality and influence of "Falcon".

Crush LLaMA, "Falcon" is completely open source! 40 billion parameters, trillion token training

In addition, the Falcon model is also multilingual.

It understands English, German, Spanish and French, and also knows a few minor European languages such as Dutch, Italian, Romanian, Portuguese, Czech, Polish and Swedish.

The Falcon 40B is also the second truly open source model after the release of the H2O.ai model. However, since H2O.ai is not benchmarked against other models on this list, these two models have not yet been in the ring.

Looking back at LLaMA, although its code is available on GitHub, its weights have never been open sourced.

This means that the commercial use of the model is somewhat limited.

Moreover, all versions of LLaMA rely on the original LLaMA license, making LLaMA unsuitable for small-scale commercial applications.

At this point, "Falcon" again came out on top.

The only free commercial big model!

Falcon is currently the only open source model that can be used commercially for free.

In the early days, TII required that commercial use of Falcon would be subject to a 10% "use tax" if it generated more than $1 million in attributable income.

But it didn't take long for the wealthy Middle Eastern tycoons to lift the restriction.

At least so far, there will be no charge for all commercial use and fine-tuning of Falcon.

Local tycoons said that there is no need to make money through this model for the time being.

Crush LLaMA, "Falcon" is completely open source! 40 billion parameters, trillion token training

In addition, TII is also soliciting commercial proposals around the world.

For potential scientific research and commercialization solutions, they will also provide more "training computing power support" or provide further commercialization opportunities.

Crush LLaMA, "Falcon" is completely open source! 40 billion parameters, trillion token training

Project submission email: Submissions.falconllm@tii.ae

This is simply saying: as long as the project is good, the model is free to use! Enough hash power! We can't make it up for you if we don't have enough money!

For start-ups, this is simply a "one-stop solution for AI large-model entrepreneurship" from the local tycoons in the Middle East.

Crush LLaMA, "Falcon" is completely open source! 40 billion parameters, trillion token training

High-quality training data

According to the development team, an important aspect of FalconLM's competitive advantage is the choice of training data.

The research team developed a process for extracting high-quality data from public crawl datasets and de-duplicating it.

After thoroughly cleaning up the redundant repetitive content, 5 trillion tokens remained—enough to train a powerful language model.

The 40B Falcon LM was trained with 1 trillion tokens, and the 7B version of the model trained with 1.5 trillion tokens.

Crush LLaMA, "Falcon" is completely open source! 40 billion parameters, trillion token training

(The research team's goal was to use the RefinedWeb dataset to filter out only the highest quality raw data from Common Crawl.)

More controllable training costs

TII says Falcon achieves significant performance gains compared to GPT-3 using only 75% of its training compute budget.

Crush LLaMA, "Falcon" is completely open source! 40 billion parameters, trillion token training
Crush LLaMA, "Falcon" is completely open source! 40 billion parameters, trillion token training

And inference only requires 20% of the calculation time.

The training cost of Falcon is only 40% of that of Chinchilla and 80% of that of PaLM-62B.

Successful efficient use of computing resources.

Resources:

https://analyticsindiamag.com/open-source-ai-has-a-new-champion/

Read on