laitimes

Multilingual large model new SOTA! The latest open-source Aya-23: supports 23 languages, 8B/35B optional

Edit: LRS

Aya23 has achieved a balance in model performance and language coverage, and the largest 35B parameter model has achieved the best results in all evaluation tasks and languages covered.

While LLMs have flourished over the past few years, much of the work in this area has been English-centric, that is, while the models are very capable, they are limited to languages with a large number of speakers, such as English and Chinese, and tend to perform poorly when dealing with low-resource languages.

If you want to break the multilingual game, two keys are a powerful multilingual pre-trained model and a sufficient amount of instruction training data covering multiple languages.

To address the above issues, Cohere, a Canadian unicorn AI company, recently open-sourced Aya23, a multilingual model in two sizes (8B and 35B), with Aya-23-35B achieving the best results across all assessment tasks and languages covered.

Multilingual large model new SOTA! The latest open-source Aya-23: supports 23 languages, 8B/35B optional

Paper Links:

https://cohere.com/research/papers/aya-command-23-8b-and-35b-technical-report-2024-05-23

Aya-23-8B: https://huggingface.co/CohereForAI/aya-23-8B

Aya-23-35B: https://huggingface.co/CohereForAI/aya-23-35B

The 23 languages covered are Arabic, Chinese (simplified and traditional), Czech, Dutch, English, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Turkish, Ukrainian and Vietnamese.

As part of the Aya initiative, Cohere initially worked with more than 3,000 independent researchers from 119 countries to create a vast multilingual instruction dataset, the Aya Collection, containing 513 million prompt and completion samples, and used the data to train Aya 101, a language model covering 101 languages, which was open-sourced in February 2024.

But the Aya 101 model is based on mT5, which is outdated in terms of both knowledge and performance, and Aya 101 is mainly focused on coverage, not performing well in some specific languages.

The open-source Aya-23 model is designed to strike a balance between language breadth and depth, and all Aya models are essentially based on Cohere's Command family model and Aya Collection, but this time the focus is on allocating more capacity to the main 23 languages to improve the generation of target languages.

Multilingual model Aya 23

Pretrained model architecture

The Aya 23 model family is a series of pre-trained models based on the Cohere Command series, which are trained using text data in 23 different languages; The Aya-23-35B is an improved version of the Cohere Command R model, with further fine-tuning to improve performance.

模型采用了标准的decoder-only Transformer架构:

1. Parallel Attention and Feedforward Network (FFN) Layer: Similar to PALM-2, the parallel block architecture is used to significantly improve training efficiency without compromising model quality, especially in tensor parallel (TP) settings, where different parts of the model are trained simultaneously on multiple processors or devices.

2. SwiGLU Activation Function: With higher downstream performance than other activation functions, the researchers adjusted the dimensions of the feedforward network (FFN) layer to maintain approximately the same number of trainable parameters compared to non-SwiGLU activation functions.

3. No bias: All biases are removed from the dense layer to improve training stability.

Multilingual large model new SOTA! The latest open-source Aya-23: supports 23 languages, 8B/35B optional

4. RoPE (Rotated Position Embedding): It can help models better understand and infer contextual information in long texts. RoPE also provides better performance when processing short text than other relative positional encoding methods such as ALiBi.

5. Tokenizer: The model uses a Byte Pair Encoding (BPE) tokenizer with a size of 256k. In the word segmentation process, NFC (Normalization Form C) normalization is performed, that is, the text is standardized before word segmentation to ensure consistency. The numbers are split into separate tokens so that the model can better understand and process the digital information. The tokenizer is trained on a balanced subset of the pre-trained dataset to ensure that text in different languages can be efficiently represented.

6. Grouped Query Attention (GQA): Each key-value (KV) header is shared with multiple query (Q) headers, which can reduce the memory usage and improve efficiency during model inference.

Instruction fine-tuning

Due to the relative scarcity of multilingual instruction data, the researchers employed a variety of strategies to enhance the usability of the data:

1. Multilingual Templates: Convert specific natural language processing (NLP) datasets into instruction and reply pairs using structured text. The datasets used include samples from the xP3x dataset and the Aya dataset, resulting in a large dataset of 55.7 million samples, covering 23 languages and 161 different datasets.

2. Human annotation: The Aya dataset contains 204,000 human-curated prompt-response pairs written by native speakers in 65 languages. We filtered through the 23 languages we used to train the model and got 55,000 samples.

3. Translation data: Samples translated from the widely used English instruction dataset were used, randomly selected from different datasets and different languages to maintain diversity, and the final data contained 1.1 million samples.

4. Synthetic data: Using ShareGPT5 and Dolly-15k human annotation prompts, Aya used Cohere's Command R+ to generate translated multilingual responses to ShareGPT and Dolly prompts for all 23 languages, resulting in 1.63 million samples.

Experimental evaluation

Discriminative tasks

The researchers used different models to test on the Multilingual Machine Learning Understanding (MMLU) benchmark in 14 languages, which is a subset of the multilingual MMLU test languages supported by the Aya 23 family of models.

The Aya-23-8B model performed best of all the smaller models compared, achieving an average accuracy of 48.2% across all tested languages and achieving the highest scores in its category on 11 of the 14 languages in which the 5-shot assessment method was used, in accordance with the test criteria for the English MMLU.

Multilingual large model new SOTA! The latest open-source Aya-23: supports 23 languages, 8B/35B optional

When comparing larger models, the Aya-23-35B model outperformed the Mixtral-8x7B-Inst model in average scores (58.2% and 57.1%, respectively).

While Mixtral performed slightly better in resource-rich languages, Aya-23-35B performed particularly well in non-European languages, such as Arabic, Hindi, and Vietnamese, where Aya-23-35B showed a 12.1%, 10.0%, and 6.5% improvement in accuracy, respectively. This suggests that the Aya-23-35B has stronger performance when dealing with fewer resources or non-European languages.

Multilingual Mathematical Reasoning

In the Mathematical Problem Solving Proficiency Test (MGSM), the Aya 23 series model performed best of all comparable baseline models, demonstrating that the model has the ability to perform strong mathematical reasoning in different languages.

Specifically, the Aya-23-8B model scored a whopping 36.6 points on average across 7 languages, while the Gemma-1.1-7b model, which ranked second in its class, scored 34.0 points.

Multilingual large model new SOTA! The latest open-source Aya-23: supports 23 languages, 8B/35B optional

In particular, the Aya-23-8B model outperformed the Aya-101-13B model (score 8.1) by 4.5 times, which once again underscores the importance of high-quality pre-trained models.

For a larger model, the Aya-23-35B model outperformed the Mixtral-8x7B-Instruct-v0.1 model with a score of 53.7.

In terms of individual language scores, with the exception of the Aya-23-8B model for French and Russian, and the Aya-23-35B model for Japanese, the Aya 23 series of models outperformed the strongest models in their class in each language, indicating that the Aya 23 series models generally outperform their peers in their ability to solve mathematical problems, although further optimization may still be needed in some specific languages.

Generative tasks

The researchers also tested the translation task (FLORES) paired with English in 23 languages of the Aya 23 series model, and the summarization task (XLSum) in 15 languages.

In the evaluation benchmark, the Aya 23 series model performed significantly better than other models of similar scale.

Specifically, the Aya-23-8B model had an average spBleu score of 37.2 in the translation task, 4 points higher than the second-ranked Aya-101-13B model, and the Aya-23-8B and Aya-101-13B models had an average RougeL score of 27.5 in the abstract task, 14.5 points higher than the next best model, Gemma-1.1.

Multilingual large model new SOTA! The latest open-source Aya-23: supports 23 languages, 8B/35B optional

In the comparison of large models, Aya-23-35B outperformed Mixtral-8x7B with a spBleu score of 7.8 points (40.4 vs 32.6) on the translation task and 23.8 points (30.9 vs 7.1) on the summary task.

It can also be noted that the Mistral-7B and Mixtral-8x7B models tend to generate English responses in prompts, which also leads to poor performance of the model in multilingual summarization tasks.

Read on