BugBuilder: An automated approach to building high-quality, large-scale defect libraries

author:HUAWEI CLOUD Developer Alliance

This article is shared from HUAWEI CLOUD Community "BugBuilder: Automatic Construction Method of High-quality Large-scale Defect Library-Cloud Community-HUAWEI CLOUD", authored by HUAWEI CLOUD Software Analysis Lab.

1. Problem scenario

For various reasons, such as error location, software testing, program fixing, and defect prediction, research in the software engineering community urgently needs large-scale, high-quality defect libraries. First, real-world defects and their precise fixes are essential for rigorous evaluation of many automated or semi-automated error statement locations, prediction of the number of software defects, and error application of fix methods. We expect these methods to work well on real-world applications, so before these methods can be widely applied, they must be evaluated with a large number of real-world defects and their fix patches.

While defects can also be used for evaluation through automatic mutation or manual injection, they may be fundamentally different from real-world defects, so the conclusions drawn from them may not apply to real-world defects. Second, having actual bugs and fixing patches can also motivate researchers to come up with new ways to find, locate, and fix software defects. For example, by analyzing a large number of real-world defects, researchers may be able to figure out what kind of statements are more prone to errors, so they can try to fix such statements first when they are automatically repaired to improve the efficiency of program fixes.

By reading human-written patches, researchers have identified many common repair patterns that in turn have been used to significantly improve automated remediation capabilities. Finally, data-driven and learning-based automated remediation and defect detection methods often rely on a large number of different real-world defects and accurate patches. It is worth noting that the quality of these defects, such as the diversity of defects and the accuracy of patches, can significantly affect the effectiveness of such data-driven approaches.

Existing manually or semi-automatically built defect libraries (such as SIR, BugBench, Defects4J, etc.) have high construction costs, and the scale and diversity of defects are very limited. Fully automated repository of defects (such as iBUGS and ManyBugs) contain bugs that are of questionable quality fixes and often contain code changes (such as refactorings) that are not related to the defect.

2. Our Contribution

In order to solve the above problems, we and the team of Liu Hui from Beijing Institute of Technology jointly proposed and developed BugBuilder, a fully automatic construction method for high-quality large-scale defect libraries, which automatically extracts complete and accurate defect repair patches from human-written patches in the version control system. The workflow is shown in the following figure.

BugBuilder: An automated approach to building high-quality, large-scale defect libraries

Specifically, for each defect fix commit, it works as follows.

• First, identify the refactoring operation. Identify refactoring actions in defect fix commits through existing tools (i.e., RefactoringMiner) and reapply the identified refactoring actions to the defective version to remove refactoring.

• Second, construct possible fixes. Automatically generate all potential patches by enumerating all possible combinations of the remaining non-refactoring changes.

• Finally, verify and select the patch. Verify the correctness of patches by executing test cases and filter out those that fail the test. If only one is ultimately valid, it is considered a precision patch. If more than one patch passes verification, a series of heuristics are used to select the most likely patch (see paper [1] for details).

It's worth noting that if a human-written patch consists of refactoring and bug fixes, BugBuilder splits it into two ordered patches: refactoring patches and bug fix patches. This is similar to Defects4J, which splits human-written patches into flaw-agnostic and defect-fix patches.

3. Evaluation of method effectiveness

This article evaluates the effectiveness of BugBuilder from two perspectives.

First, apply BugBuilder to the 809 real bug fix commits collected by Defects4J. For each commit, BugBuilder is used to automatically pull the exact patch, and if a patch is obtained, it is compared to the manually constructed patch in Defects4J. Out of 809 bug fixes, BugBuilder automatically generated 350 patches, 334 of which were identical to those manually constructed in Defects4J. After manual analysis, 12 of the remaining 16 auto-generated patches are more complete and accurate than those manually constructed by Defects4J. Only 4 are inaccurate, mainly due to incomplete detection of refactoring operations. It can be seen that BugBuilder can accurately extract defect repair instances.

Second, the above method was used to construct a large-scale defect library GrowingBugs (, containing 1916 real defects and accurate fixes automatically collected from 169 well-known Java applications. The number of defects is more than twice that of the famous defect library Defects4J, and it continues to grow.

4. Summary

The proposed method makes it possible to construct high-quality large-scale defect libraries fully automatically. The defect library built based on this method can also be used as a benchmark to promote defect-related research.


[1] Jiang Y, Liu H, Luo X, Zhu Z, Chi X, Niu N, Zhang Y, Hu Y, Bian P, and Zhang L. BugBuilder: An Automated Approach to Building Bug Repository[J]. IEEE Transactions on Software Engineering, 2022.

The article is from the PaaS Technology Innovation Lab, a subsidiary of HUAWEI CLOUD that is committed to comprehensively utilizing software analysis, data mining, machine learning, and other technologies to provide software developers with the core engine and intelligent brain of next-generation intelligent R&D tool services. We will focus on the hard core capabilities in the field of software engineering, continue to build R&D tools, and continue to deliver high-value business features! Join us and create a new "realm" of R&D together! (For details, please contact;

PaaS Technology Innovation Lab homepage link:

Follow #HUAWEI CLOUD Developer Alliance# Click below to learn about HUAWEI CLOUD fresh technologies~

HUAWEI CLOUD Blog_Big Data Blog_AI Blog_Cloud Computing Blog_Developer Center-HUAWEI CLOUD