laitimes

大模型效能工具之智能CommitMessage

author:Flash Gene

I. Background

With the rapid growth of large language models, the applications of various models in various fields have sprung up rapidly. In terms of the efficiency of the whole process of R&D, a series of efficiency improvement and quality tools have also emerged throughout the whole process, such as the high-quality RAG assistant for the high-cost Oncall; Auxiliary programming tools such as Copilot, Comate, and Tabnine in the development stage; In the testing stage, there are also tools such as defect checking, security compliance checking, and intelligent code review. Even in the delivery stage, there are automated agents that replace manual labor...

When you use git commit to commit code, you need to write a complicated CommitMessage, sometimes you can't be hooked after writing it but it doesn't meet the commit specifications, and sometimes it can't be written by CodeReview's classmates... Smart CommitMessage is a small assistant that helps you automatically generate a CommitMessage that meets the specifications according to the submission specifications.

Taking the submission specification of Baidu APP as an example, the specification includes the submission category, product version, requirement card, change summary, etc., among which the categories include: function, update, optimization, testing, boarding, Merge, FixBug, etc., and manual writing is more complicated.

According to the combinatorial standard of CommitMessage, it can be divided into two parts: canonical format + change summary:

大模型效能工具之智能CommitMessage

CommitMessage组成部分

  • Normal Abstract Class: Submit a canonical format + a summary of changes
  • FixBug: Submit the specification format + change summary (including the cause of the bug, its impact, and how to fix it)

The large model capability is used to generate a change summary part, and the commit specification format and other tags are customized by the personalized plug-in, which can be customized to meet the commit specification for different business lines/product lines.

Here's what the end result of Smart CommitMessage looks like:

大模型效能工具之智能CommitMessage

git aicommit 用法示例

The following takes Intelligent CommitMessage as an example to introduce the development process of large model performance tools, mainly including:

  • Simple functional design
  • App metrics and model evaluation metrics
  • Large model data processing process
  • Several ways to optimize model performance

Two. Functionality & Design

用户入口一:git aicommit

Git is an efficient and convenient version control system, although the mobile terminal of Baidu APP has been multi-warehoused, with the improvement of the componentization process, at least half of the requirements do not need to commit across repositories to use Git.

用户入口二:mgit aicommit

MGit (https://github.com/baidu/m-git) is a set of open-source, Git-based multi-repository management tools developed by Baidu, which can safely and efficiently manage multiple Git repositories for multi-repository application scenarios.

Basic requirements for entrances:

  1. The use of Git/MGit entries does not affect the use of the original git/mgit commit function, but only expands the capabilities
  2. While ensuring the separation of Git and MGit's entrances, it also ensures unified functions and low-cost maintenance

Solution: Abstract the implementation of the common module git-aicommit, which is directly called by the MGit plug-in and Git alias commands, and the development language selection ruby is convenient for the MGit plug-in to be directly called.

git-aicommit module: extracts the changes in the Git staging area of all commit repositories and requests the model service to generate a Commit Message.

大模型效能工具之智能CommitMessage
  • MGit/Git entrance, that is, the user uses the entrance, for MGit plug-ins, you can refer to how MGit extends (https://github.com/baidu/m-git); Git Alias can be configured as follows:
# 给 git 添加 Alias:git aicommit
$ git config --global alias.aicommit '!f() { ruby -e '\''require "git-aicommit"; MGit::GitAICommit.run(ARGV);'\'' -- "$@"; }; f'           
  • Personalized plugins: The format of the submission specification is customized, any different submission specification can be customized as a separate plugin, please refer to the following custom submission specification section for details.
  • Model service: Accepts the request of the git-aicommit module, calls the LLM to generate the summary content of the CommitMessage, and loads the corresponding personalization plug-in to generate the final CommitMessage.

Assessment indicators

管理学之父彼得·德鲁克说过:“If you can't measure it, you can't manage it.”。

Metrics are critical for model selection, subsequent Prompt tuning, and SFT because it determines the criteria for optimization.

When generating CommitMessage, it is necessary not only to understand the code of the change, but also to generate the corresponding summary, evaluate the impact, etc., the generative large model is suitable for this kind of task, and the current generative large model is also blooming in the market, after comprehensive evaluation of the use cost (including data management, deployment and operation and maintenance, performance tuning, Prompt and model evaluation), generation quality, security risks and other considerations, we chose the ERNIE4 of Baidu Intelligent Cloud Qianfan Platform.

针对此类摘要任务,常用的度量指标有BLEU Score、ROUGE Score、BERT Score、PPL、MSE等,结合生成CommitMessage的任务特性,最终确定模型和产品的核心指标:

  • Model performance index: Mean Squared Error (MSE), which measures the semantic similarity between the generated text sequence and the text sequence that references CommitMessage.
  • User usage metric: Acceptance Rate (AR), also known as user direct satisfaction, is the proportion of the number of times that users directly adopt the CommitMessage generated by the model service to the total number of times it is used

丨3.1 均方误差MSE(Mean Squared Error)

The text sequence of the reference CommitMessage refers to a high-quality, concise, accurate, and standard CommitMessage, and the objective criteria are at least including: why it was modified (Why), what was changed (What), and the impact area (optional), and the subjective standard is manual screening and extraction.

By definition, calculating MSE is the semantic similarity difference between two pieces of text, which is simply divided into the following three steps:

  1. Embedding Quantification:
  2. Convert two pieces of text to a vector representation. There are too many ways to embedding in the era of large models, and the Embedding method of Qianfan is still directly selected here.
  3. Vector Difference Calculation:
  4. When calculating the difference or distance between the vector representations of two pieces of text, we choose to use cosine similarity; Although Euclidean distance and Mahalanobis distance are also commonly used, cosine similarity is more accurate for vectors with inconsistent lengths, such as CommitMessage.
  5. Mean Squared Error Calculation:
  6. The difference or distance is squared and then the mean is calculated to get the mean square error.
  7. where xi and yi are the representations of the two pieces of text in the ith dimension, and n is the number of dimensions.

There are two concepts that are presented several times in this article:

Refer to CommitMessage: It can be a submitted CommitMessage generated by RD, or a CommitMessage generated by a large model and manually annotated to ensure quality, as the evaluation standard

生成CommitMessage:由大模型生成的CommitMessage,评估输入的判定项xi

丨3.1 直接采纳率AR(Adoption Rate)

The total number of uses includes 3 results:

  • Direct Adoption CA
  • Edit the number of CEs adopted
  • Number of rejections CR
大模型效能工具之智能CommitMessage

4. Data Processing

In order to achieve better performance (including generation quality, efficiency, accuracy, and adoption rate), the cost of data processing is high, which usually accounts for a considerable proportion of the entire application development investment, and sometimes may even exceed the workload of model training and tuning. In conclusion, effective and efficient data processing is a key factor in improving the performance of the model, so sufficient attention and investment should be given in the project planning and resource allocation.

The goal of data processing is to manage (add, delete, modify/index) the data set, the product is all kinds of data sets, and the final application scenario of the data set is the performance optimization of the model (model selection, Prompt optimization, SFT), that is, if the performance is not optimized, there is no need to do data processing.

The relationship between datasets and performance optimization is as follows:

  • The evaluation set, model selection, prompt tuning, and SFT all need to evaluate the tuning to see if it is better than the previous one and whether the tuning effect is achieved
  • The training set refers to the annotated data used for SFT, which is filtered from the total dataset based on its features
  • Validation sets, which are used to adjust model hyperparameters to avoid overfitting or underfitting
  • Test set, after SFT, test whether the purpose of SFT is achieved, such as evaluating the generalization ability of an abnormal case
  • Anomaly set, low-quality CommitMessage data with clear annotation links, especially data generated by large models that have not been directly adopted
大模型效能工具之智能CommitMessage

Here is an introduction to the general process and its functions, and the details will not be expanded:

  • Define data structures: model data (requirements/bug card titles, change data), reference CommitMessage, category data (whether there are bugs, change lines, number of repositories), and auxiliary analysis data (product line, platform, author, topic), etc
  • Data collection: Source: (1) CommitMessage generated by the online model service; (2) CommitMessages that have been submitted to the database by the existing RD; (3) Other open-source datasets
  • Data cleaning: denoising, deduplication, and other processes to ensure the quality and availability of data
  • Annotation and annotation: This data is annotated as a reference to the quality of CommitMessage and other auxiliary analysis information
  • Classification and management: sampling ratio, filtering, viewing, etc

Based on our current data volume, we chose Pandas (https://pandas.pydata.org/) as the data processing tool, which provides sufficient data processing and analysis capabilities for small-scale data and stand-alone environments. However, as the volume of data grows, Spark (https://spark.apache.org/) will be a good choice.

5. Performance optimization

The goal of performance optimization is to improve performance indicators, including the core metrics mean square error (MSE) and generation efficiency, so as to improve the direct adoption rate of AR by the following three means:

  • Stop tokens can be used to improve the efficiency of generation
  • Prompt optimization optimizes MSE metrics and improves generation efficiency
  • SFT optimizes MSE indicators

丨5.1 Stop Marking

When the model does not fully understand the Prompt, it is easy to generate invalid content such as redundant explanations or precautions, and the generation of more tokens leads to a decrease in the generation efficiency (the generation efficiency is directly related to the length of the generated Token), and all Transformer models are designed with stop tokens, for example, the output of the invocation model in the intelligent CommitMessage is a Markdown json, ending with "%STOP%", and the stop flag can be specified as "% STOP%" to improve build efficiency.

丨5.2 Prompt优化

To put it simply, Prompt optimization is all about designing and optimizing input prompts to get the desired output. It seems like a simple NLP task, but it's called a Prompt project? Because the need for large models to better understand the desired requirements does involve multidisciplinary knowledge, such as integrating linguistics, psychology, computer science, data science, and a whole set of engineering methods: system design, experimental design, quality control, project management, and so on. There are two optimization points involved in Smart CommitMessage:

1. Limit the output content and make it clear

If the output of the model is not normal, it will cause parsing exceptions, so in the Prompt, it is clearly required to "output only the content without any explanation" to avoid generating invalid content and improve the efficiency and accuracy of generation.

2. Few-shot

There is an optimization in the Prompt optimization that is very effective in the case of restricting the output style --Few-shot, which uses an example to make the large model understand and limit the output style, requiring the output of a Markdown multi-line json data, example:

按以下格式输出CommitMessage,只是一个markdown的代码片段,包含在"```json" 和 "```"内,『请仅输出内容,不要做任何解释』:
```json
{
    "summary": string  // 少于30字的中文,简洁的、准确的描述Git Commit Message
    "reason": string  // 分析修复方式,详细描述这个bug出现的具体原因,可以引用代码,少于60字
    "fixup": string  // 分析修复方式,简洁、准确的描述修复方式,可以引用代码,少于30字
}
```           

The example here is not a standard json format ("," is missing when wrapping multiple lines), and the large model may be output in this format or in the correct json format, so there is an exception problem that can be completely avoided by improving the few-shot:

按以下格式输出CommitMessage,只是一个markdown的代码片段,包含在"```json" 和 "```"内,『请仅输出内容,不要做任何解释』:
```json
{
    "summary": string,  // 少于30字的中文,简洁的、准确的描述Git Commit Message
    "reason": string,  // 分析修复方式,详细描述这个bug出现的具体原因,可以引用代码,少于60字
    "fixup": string  // 分析修复方式,简洁、准确的描述修复方式,可以引用代码,少于30字
}
```           

Here's a similar concept: Prompt Tuning, Few-shot, and Prompt Tuning are all ways to optimize and tune input prompts for large language models, but with essential differences:

Focuses on way apply
Prompt Few-shot Locally, at the time Self-training and fine-tuning Individual needs
Prompt Tuning Ahi Prompt example Individual needs

Attach some prompts of smart CommitMessage (continuous optimization):

角色名称:Git Commit Message Generator
通俗易懂的角色描述:基于需求描述和实现该需求的git diff变更代码,自动生成规范的git提交信息。
需求描述的标题如下:{{%title}}


git diff变更代码如下:
(DIFF-START)
{{%git_diff}}
(DIFF-END)


任务拆解
1. 解析需求标题:
提取关键信息,如功能点、问题点等。
对文本进行清洗,去除无关字符和格式。
2. 分析git diff变更代码:
识别变更的文件和代码块。
分析代码变更的类型(如新增、修改、删除等)。
3. 生成Commit Message:
结合需求标题以及代码变更分析,编写Commit Message。
确保提取的内容符合对应项的要求,如“summary: 少于30字的中文,简洁的、准确的描述Git Commit Message”等。
4. 验证Commit Message:
检查Commit Message是否清晰、准确。
5. 按以下格式输出CommitMessage,只是一个markdown的代码片段,包含在"```json" 和 "```"内,『请仅输出内容,不要做任何解释』:
```json
{
  "summary": string  // 少于30字的中文,简洁的、准确的描述Git Commit Message
}
```%STOP%           

丨5.3 SFT

Because Wenxin 4's model capability already has a very good generation ability, and the cost of SFT on this large model is very high, the ERNIE-lite version or ERNIE-Speed version is generally used, but the performance is slightly inferior, so how to ensure that the overall performance can not be reduced after SFT in the ERNIE-Speed version, but also the low-quality case can be optimized?

Here, the MoE (Mixture of Experts) strategy can be used to combine the advantages of ERNIE4 + (ERNIE-Speed + SFT) with a classifier, that is, the request preferentially passes through a classifier and classifies the request according to the characteristics of the requestERNIE4 or the ERNIE-Speed model after SFT, as shown in the following figure:

大模型效能工具之智能CommitMessage

Remember the full evaluation of the SFT evaluation dataset before deployment, and the MSE is better than the online guarantee that the ERNIE-Speed model after this SFT is better than the last one.

The whole process of SFT should consist of four steps:

  1. Determine the goal: Optimize a certain type of low-quality data case and fine-tune it to achieve the evaluation score
  2. Data preparation: Extract the features of low-quality cases based on the case, and filter out the training, validation, and test sets from the dataset
  3. SFT process: as shown in the image above
  4. Evaluation and deployment: Full evaluation is carried out according to the evaluation set of sampling ratio to ensure that the ERNIE-Speed model after this SFT is better than the last one

6. Custom Submission Specifications

Since the large model only generates the core change summary or Fixbug related information, and finally needs to be combined into a variety of commit specification formats, the changes can be abstracted into interfaces, and the python package implementation interface can be extended to achieve a custom CommitMessage that meets the commit specification, and the implemented plug-ins can be dynamically loaded on demand.

The abstract interface is as follows:

from abc import ABC, abstractmethod


class IPluginHook(ABC):
    """插件实现的接口定义"""
    @abstractmethod
    def hook_prepare(self, ctx):
        """准备"""


    @abstractmethod
    def hook_is_fix_bug(self, ctx) -> bool:
        """是否fixbug的提交类型,默认false"""


    @abstractmethod
    def hook_language(self, ctx) -> Language:
        """生成语言,默认中文"""


    @abstractmethod
    def hook_generate_variables(self, ctx):
        """生成模板的变量"""


    @abstractmethod
    def hook_generate_message(self, ctx) -> str:
        """根据模板和变量,生成CommitMessage
        @warning: 该方法插件必须实现,否则将报出异常
        """           

When loading a certain version of a plugin, it is determined that it has been loaded according to the pkg_resources, and then it can be used with importlib to import_module or reload to dynamically load the plugin

def __install_plugin(pkg_name: str, version: str):
    """安装插件"""
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', f"{pkg_name}=={version}"])
    return __load_module(pkg_name, force=True)


def __load_module(pkg_name: str, force: bool = False):
    """加载module"""
    module_name = __module_name(pkg_name)
    loaded_module = sys.modules.get(module_name)
    if loaded_module is not None:
        if force:
            m = importlib.reload(loaded_module)
            importlib.reload(pkg_resources)
            return m
        return loaded_module
    m = importlib.import_module(module_name)
    importlib.reload(pkg_resources)
    return m           

VII. Future

The large model has shown excellent ability in code understanding of various languages, but the understanding of proprietary vocabulary, specific configurations, fixed formats, etc. is still insufficient, and appropriate datasets are needed to gradually optimize. In addition, the change content obtained by git diff is limited, limited by the limitation of model tokens, and the lack of code context and dependency correlation when understanding leads to a bottleneck in the generation quality, so it may be a better way to combine RAG. The interactivity of using the entrance, the custom submission specification can be more AI, in short: AI Native has not been successful, and comrades still have to work hard.

Bibliography:

[1] LangChain:https://www.langchain.com/

[2] git:https://git-scm.com/book/en/v2/Git-Basics-Git-Aliases

[3] Pandas: https://pandas.pydata.org/

[4] Spark:https://spark.apache.org/

[5] Baidu Chiho: https://console.bce.baidu.com/qianfan/overview

[6] Application and Practice of Prompt Engineering Large Model: https://zhuanlan.zhihu.com/p/668200325

Author: zy

Source-WeChat public account: Baidu App Technology

Source: https://mp.weixin.qq.com/s/IeENM0o59vVrvK9oD4plZA

Read on