laitimes

"Code model" has become a new outlet for AI, and aiXcoder wants all enterprises to use it first

author:Geek Park

On April 9, the Institute of Software Engineering of Peking University open-sourced a new self-developed 7B code model developed by its aiXcoder team. As a professional team in the field of "AI + software development", aiXcoder's open-source 7B code model has the potential to bring new possibilities to enterprise "software engineering".

In the U.S., GitHub Copilot, an AI software development tool, reached $100 million in ARR (annual recurring revenue), marking a milestone for AI in developer applications. Nadella called it "Microsoft's most mature GenAI product" and "as necessary for programmers as spelling and grammar checking in Microsoft Word."

But in fact, for AI to truly help developers solve problems, "spelling and grammar checking" is only the surface, and from the perspective of "software engineering", it is fundamental to help developers solve problems in real software development scenarios.

The aiXcoder team from the Institute of Software Engineering of Peking University is not only the first team in the world to use deep learning technology for program code processing, but also the first team to launch programming products based on deep learning.

The aiXcoder team has opened up a new generation of fully self-developed 7B code model, focusing on the real development scenarios of enterprises, and is committed to solving the software development problems in the private deployment scenarios of enterprises. The Base version of the model has been open-sourced on platforms such as Github, Gitee, Gitlink, and others.

The evaluation results of multiple mainstream evaluation sets show that the aiXcoder-7B model shows the effect of tens of billions of code models in terms of code generation and completion capabilities, which is faster and more accurate, and because it fully considers the needs of enterprise privatization deployment, it is more suitable for the real development scenarios of enterprises.

In today's AIGC startup wave, will the first-mover advantage of the aiXcoder team make it stand out?How far are we from the vision of software development automation?

01 The aiXcoder team has been "exploring" for 10 years

In 2014, researchers from the Institute of Software Engineering at Peking University (the predecessor of the aiXcoder team) published a number of landmark research papers such as "Building program vector representations for deep learning", which verified the role of deep learning routes in the field of code processing for the first time in the world.

But at the time of the study, it was an extremely bold assumption to use a deep neural network to process the program code. At that time, software practitioners generally believed that "software automation was just a long-term and beautiful vision", and the AI field only had a "can try" attitude.

At that time, as Turing Award winner Geoffrey Hinton solved the problem of multi-layer neural network training, deep learning technology has successively shined in the fields of speech, vision, and natural language, and related technologies are rapidly changing in the industry, but no one has yet linked it to the processing of program code.

Computers can compile and run programs, so why should computers learn to understand programs?

However, for the researchers of the Peking University Institute of Soft Engineering, this question is a topic of "planting grass" from the first day of research. In their view, people write programs because computers can only "understand" programming languages. "When people write programs, they provide services to machines." The researcher's mission is to liberate man from the dilemma of "serving the machine".

However, in order for a computer to be able to write a program on its own, it needs to first learn to understand the program. That is, not only should the computer be able to run the program, but it must also be able to understand the semantics of the program it is running.

Based on this belief, they put forward a bold conjecture: if a program on the Internet is "crawled" down and a "deep learning engine" is trained to process the program, then it can be realized, and the engine is given a program code, which can automatically analyze the intent of the program. Conversely, give the engine an intent description and it can generate code.

In the winter of 2013, under the leadership of the team leader, Professor Li Ge, they began to apply deep learning technology to the field of program processing.

At that time, there were no familiar deep learning frameworks - TensorFlow and Pytorch, so the group chose to use the C++ programming language to build neural networks from the very bottom, including backpropagation algorithms, derivation algorithms, etc.

As an exploratory study, the team's second challenge was the lack of GPUs.

At that time, they found 17 "retired" PCs from Peking University's Institute of Software Engineering, laid out a floor in Peking University's 1726 computer room, and connected them with a local network to achieve the task of training a deep neural network.

The training efficiency can be imagined, and 50,000 pieces of data ran for more than a month.

Fortunately, the results are gratifying, according to the paper mentioned above: in the program analysis task, program function classification, program pattern detection, deep learning has more accurate results than previous methods.

This led the team to see the prospect of applying deep learning technology to program processing, and they embarked on a journey that lasted for ten years.

In 2015, they published their first program generation paper, "On End-to-End Program Generation from User Intention by Deep Neural Networks," which pioneered the use of deep learning techniques to generate program code.

In May 2017, the initial lab version of aiXcoder, deep-autocoder, was launched, which started the exploration of continuously improving the number of model parameters for code completion, improving the interactive form of products, and the effect of code completion and generation is getting better and better.

In 2018, as the product polishing matured, the laboratory products also began to face the industry, and the prototype of aiXcoder was released in June, and continued to cooperate with Huawei, Baidu, Tencent and other leading enterprises to do code generation and completion, code analysis and other technologies of proof of concept. Subsequently, aiming at different domains, the aiXcoder team began to explore the path of different models to drive different tasks, and launched model versions in different domains such as Android version and python version.

It wasn't until OpenAI's Codex came out in 2021 that the industry began to believe that training large models of code on larger and deeper neural networks was getting better and better. Not long after, the aiXcoder team, with the support of the computing power of Pengcheng Laboratory, developed the first large code model with more than one billion codes, aiXcoder L version.

In June 2022, the aiXcoder team released the first method-level code generation model in China, aiXcoder XL, which has tens of billions of parameters and can generate complete program code (NL to Code) with one click according to natural language function description.

In August 2023, aiXcoder Europa will be launched, in addition to regular functions such as code auto-completion, generation, defect detection and repair, etc., aiXcoder Europa is tailored for enterprise scenarios, providing private deployment and personalized training services according to enterprise data security and computing power requirements, and effectively reducing the application cost of large code models while improving R&D efficiency.

While the product continues to upgrade and break through one evaluation benchmark after another, the aiXcoder team is also actively helping leading enterprise customers to realize the application of real code models in the fields of banking, securities, high-tech, and military industry.

At the same time, the capital market is also optimistic about the development prospects of aiXcoder, and industry-leading star shareholders such as Hillhouse, Qingliu, and Binfu have increased their bets.

Today, when ChatGPT ignites the enthusiasm for AI, looking back at the exploration of the aiXcoder team in the past decade, they are pioneers, and they have accumulated deep algorithms, engineering capabilities, and market know-how. When the large model technology has not yet matured to the point of "using it right out of the box", the aiXcoder team combines deep learning technology and professional capabilities in the software field to effectively help enterprises implement software automation.

02 Open source aiXcoder-7B to accelerate "software development automation"

The open-source release of aiXcoder 7B on April 9 also adheres to the core concept of its products serving enterprises.

aiXcoder 7B not only surpasses the existing models of the same level in terms of basic code generation capabilities, but even performs better than the 15B and 34B code models with larger parameter levels in some specific tasks.

As shown in the figure below, aiXcoder-7B shows obvious advantages in the evaluation sets of HumanEval, MBPP, and MultiPL-E.

"Code model" has become a new outlet for AI, and aiXcoder wants all enterprises to use it first

In fact, in the real development environment, the above-mentioned NL to Code ability to generate corresponding code according to natural language is only the basic, now the mainstream evaluation method is far away from the real scene, writing programs or code requires a variety of integration, expansion and call relationships, NL to code according to the natural language to give methods, in the actual scene problem is fragmented.

To this end, aiXcoder-7B takes into account the processing power of project-level code aligned with the actual development scenario in the construction process of the model, so it can take into account the important information contained in other relevant code files in the software project, such as class definitions, class properties, method definitions, etc., more than other models.

On CrossCodeEval (Ding et al., 2023), a code generation measurement dataset that specifically considers project-level information, aiXcoder-7B demonstrated significantly better accuracy than other models. (Below)

"Code model" has become a new outlet for AI, and aiXcoder wants all enterprises to use it first

At the same time, in the code completion scenario, developers not only expect the model to generate a complete code block with a complete syntax structure and complete processing logic, but also need to call the functions that have been implemented by the context as much as possible, so as to conform to the actual development style.

In order to evaluate the completion effect in real-world development scenarios, aiXcoder also showed the best results on an open-source evaluation dataset with more than 16,000 real-world development scenario codes. Moreover, aiXcoder 7B prefers to use shorter code to accomplish user-specified tasks than other models.

"Code model" has become a new outlet for AI, and aiXcoder wants all enterprises to use it first

Why does aiXcoder-7B achieve such extreme results on datasets that are close to real enterprise development scenarios?

The aiXcoder team revealed to Geek Park that its code model technology "grown" from customer needs makes the aiXcoder series of models, including the 7B-base model, easy to deploy, easy to customize, and easy to combine, which is more suitable for actual software development tasks, deeply integrated with enterprise application scenarios, and more suitable for implementation.

When ChatGPT caught fire early last year, everyone's expectations for large model technology were quite high, and general artificial intelligence seemed to be one step away. For example, in the first step of landing AI applications, although there are many SOTA large models on the leaderboard, customers must first select models according to their own scenario evaluation. At the same time, when evaluating such large code models, enterprises place special emphasis on effective personalized training to ensure that the models can meet the specific needs of enterprises.

The difficulty here is that most large models are trained on open-source datasets, and when they are applied to the enterprise, they are faced with new business logic and programming specifications, which often leads to degraded model performance. Therefore, how to do a good job of personalized training so that the model can learn and adapt to the specific domain knowledge of the enterprise is the key.

In terms of landing applications, aiXcoder has always met customer needs and accumulated a lot of practical and unique technical methodologies when the large-scale model technology has not yet become popular. In terms of inference speed, this is the most significant difference between the code model and the general model, which needs to provide faster feedback and implement "imperceptible service" in the integrated development environment (IDE), that is, provide instant feedback to programmers as they write code without interrupting their train of thought. Under this goal, the team has accumulated a lot of experience from the perspective of algorithms and deployment.

When asked about the changes in market demand that they have seen over the past decade, the aiXcoder team said, "The most obvious trend is that developers' attitudes towards AI-assisted software development have gradually shifted from being unfamiliar and wait-and-see to actively embracing change."

For the field of AI, there is a light at the end of the tunnel for automating software development through large models of code. For the aiXcoder team, the goal of software automation is the starting point of the software engineering discipline, and the ultimate goal is to liberate humans from the heavy labor of software development.

When it comes to the future product form in the code field, the aiXcoder team believes that it is an intelligent integrated development environment, starting from conversational requirements and multi-agent collaboration for software automation development. As for the extent to which end-to-end process automation can be achieved, the team illustrated its prospects with the "iron tong model". "The entire software development process is sandwiched between the two sides of 'requirements' and 'tests', and the automation process led by the large model in the middle, which is the future that software development can look forward to."

"It is a fortunate and happy career to take a step forward along the path pointed out by our predecessors", and the mission of "software automation" is not the business of the next generation.

Read on