laitimes

INTSIG's Embedding model won the first place in the C-MTEB list

author:Fast and easy to talk about

At this stage, the rapid development of large language models has attracted the attention of all walks of life, and the Embedding model that supports the application of the model has also become the focus of attention in the industry. Recently, INTSIG released the Text Vectorization Model acge_text_embedding (abbreviated as "ACGE model"), which won the first place in the MTEB Chinese list (C-MTEB). The relevant results will help the large model to generate application value in thousands of industries more quickly.

INTSIG's Embedding model won the first place in the C-MTEB list

Figure 1: C-MTEB results

The Massive Text Embedding Benchmark (MTEB) is a collection of evaluation indicators to measure text embedding models (Embedding Models), and is an important reference for evaluating the performance of text vector models in the industry. The corresponding C-MTEB is a benchmark for the evaluation of Chinese text vectors, which is recognized as one of the most comprehensive and authoritative Chinese semantic vector evaluation benchmarks in the industry, providing a reliable experimental platform for in-depth testing of the comprehensiveness and reliability of Chinese semantic vectors. Alibaba, Tencent, SenseTime, Baichuan and many other manufacturers have evaluated and released models on this list.

INTSIG's Embedding model won the first place in the C-MTEB list

The Embedding model can convert high-dimensional discrete data such as words, sentences, or image features into low-dimensional continuous vectors, capture the semantic features and relationships of data, and is widely used in search, recommendation, question answering, retrieval enhancement generation, data mining and other fields. In the Internet era, with the rapid expansion of information and the continuous expansion of people's access to information, a large amount of irrelevant information has become a distractor for information retrieval.

"If you need to learn how to make your own coffee at home, you might type 'how to make coffee at home' into a search engine, and a traditional search engine will simply match articles that contain keywords and provide some keyword-related content. Team members mentioned that with the Embedding model, the engine was able to more accurately understand user intent, providing more practical guidance including, but not limited to, coffee machine selection, coffee bean grinding techniques, different brewing methods, and more.

INTSIG's Embedding model won the first place in the C-MTEB list

Figure 2: Schematic diagram of the embedding model

In order to give full play to the value of the large model in the application process, the INTSIG information technology team created the ACGE model. Compared with the top five open-source models on the C-MTEB list, the acge model released by INTSIG is smaller, occupies less resources, and the input text length of the model is 1024, which meets the needs of most scenarios. In addition, the ACGE model also supports variable output dimensions, allowing enterprises to reasonably allocate resources according to specific scenarios.

According to team members, compared with traditional pre-trained or fine-tuned vertical domain models, the ACGE model supports the construction of general classification models in different scenarios, improves the accuracy of long document information extraction, and has relatively low application costs, which can help large models quickly create value in multiple industries and provide strong technical support for building new quality productivity.

INTSIG is an artificial intelligence and big data technology company, based on its self-developed leading intelligent text recognition and business big data core technologies, providing digital and intelligent products and services to global C-end users and B-end customers in diverse industries. According to public information, the company's C-end products cover hundreds of millions of users in more than 100 countries and regions around the world, and B-end services cover enterprise customers in nearly 30 industries. In the list of Fortune 500 companies released by Fortune magazine in 2022, the company's customers have covered more than 125.