10,000 words to interpret the safety test of the visual perception module in the automatic driving system

--Collect the "Automotive Driving Automation Classification" (GB/T 40429-2021)--

In recent years, the development of visual perception technology based on deep learning has greatly promoted the prosperity of autonomous driving in the field of vehicle networking, but the frequent safety problems of automatic driving systems have caused people to worry about the future of autonomous driving. Due to the lack of interpretability in the behavior of deep learning systems, testing the safety of deep learning-based autonomous driving systems is extremely challenging. At present, safety testing work for autonomous driving scenarios has been proposed, but these methods still have shortcomings in test scenario generation, security problem detection and security problem interpretation. For the automatic driving system based on visual perception, a scene-driven, explainable and efficient safety test system is designed and developed. This paper proposes a scene description method that can balance authenticity and richness, and uses the real-time rendering engine to generate scenes that can be used for driving system safety testing; An efficient scenario search algorithm for nonlinear systems is designed, which can dynamically adjust the search scheme for different tested systems; At the same time, a fault analyzer was designed to automatically analyze and locate the causes of security defects in the system to be tested. The existing dynamic automatic driving test system based on the real-time rendering engine is reproduced, and the CILRS system and CIL system are used to perform safety tests on the CILRS system and the CIL system at the same time, and the experimental results show that the security problem discovery rate of this work is 1.4 times that of the recreated scene-driven dynamic test method at the same time. Further experiments show that the representative deep learning autonomous driving system CIL and CILRS can search for 1939 and 1671 scenes that cause failures from a total of 3000 scenes dynamically generated in three types of environments in the wilderness, countryside and cities, respectively, and the search time of each fault scene is average 16.86s. From a statistical point of view, the analyzer judged that the CILRS system is prone to failure on both sides of the road, and rain and red or yellow objects are more likely to cause the automatic driving system to fail.

The field of internet of vehicles is booming with the deep integration of the Internet of Things and the field of transportation. With the advancement of deep learning, autonomous driving technology in the field of car networking has been developed breakthroughly, and there is a trend of evolving into a new automotive industrial revolution. Whether it is Tesla, Weilai and other new car companies, or Ford, BMW and other traditional car companies have successively obtained autonomous driving road test licenses, focusing on the development of in-depth automatic driving technology. The rapid development of deep autonomous driving technology is gradually becoming one of the main supporting technologies in the field of internet of vehicles, and is changing the future of transportation and travel.

The visual perception module is an important component of environmental perception in autonomous driving and an important basis for intelligent decision-making by vehicles. Tesla, an important company in the field of autonomous driving, has made the visual perception module the only environmental perception module for its driving system. Therefore, the safety of the visual perception module of the automatic driving system is the key to the normal operation of the automatic driving system. Although the performance of the visual perception module has steadily improved with the development of deep vision technology, the semantics of the features perceived from the driving environment are difficult to understand and the decision-making process cannot be explained. How to fully test the safety of the visual perception module of the automatic driving system has become an urgent and urgent problem to be solved.

It is true that there have been some breakthroughs in the work around the interpretability of deep learning, but there is still a long way to go before the error conduction mechanism of the visual perception module of autonomous driving is analyzed. In recent years, the progress of black box attack methods of neural networks has inspired everyone to propose some safety testing technologies for autonomous driving visual perception modules based on scene search. These scenario-driven test methods use the idea of black box testing to provide the driving system with as much driving scene data as possible, observe the difference between the output of the automatic driving system and the test oracle, and then analyze the safety of the visual perception module of the automatic driving system.

We believe that scene-driven black-box security testing is the most important test for the security of visual perception modules before clarifying the interpretability of deep learning. But at present, we still face 3 challenges in testing the application of generated scenes to the visual perception module:

1) Balance the authenticity and richness of the scene description. Scene generation rules are an important foundation for a scenario-driven test system. Conservative rule design will cause insufficient scene coverage; Too flexible rule design will destroy the relative relationship of objects and hurt the authenticity of the scene. Exploring the rules for generating a scene that combines both realism and richness is challenging.

2) Ensure the efficiency and stability of the search algorithm. The combination of the properties of a single object (e.g., color, shape) and the interrelationships between objects (e.g., position, orientation) is complex. In order to be able to test the safety defects of the system efficiently and stably, it is necessary to dynamically generate personalized scene search schemes for the visual perception modules of different automatic driving systems to ensure that there are fewer steps in the search process; At the same time, the time for single-step search is minimized.

3) Explain the accuracy and automation of test results. In the past, test systems involved in manual analysis of the causes of defects. To analyze test results automatically, the system needs to be able to finely manipulate each element in the scene to locate the cause of the system's security flaws.

Around the scene-driven visual perception module of black box security testing, the academic community has already made initial explorations. Among the scene test methods, a series of test methods based on the real-time rendering engine have received a lot of attention due to the flexibility of scene generation. Initially, scene-driven security testing based on the real-time rendering engine adopted a testing approach based on predetermined scenarios. One of the representative jobs is CARLA0. ８。 X, which works using UnrealEngine to create a fixed driving line to test the system. Next, Scenic et al. proposed a programming interface for scene generation to make such test programs more systematic, laying the foundation for static scene-based testing. However, its simulated environment is more fixed, lacks dynamic behavior, and lacks degrees of freedom in the description of non-solid objects such as weather. Based on the latest work of scene-driven security testing based on the real-time rendering engine, Paracosm proposes a dynamic scene generation method based on random search for security testing of visual perception module. Due to the relative simplicity of the dynamic scene search method and the lack of adaptability to the test visual perception module, the search process for security issues is not efficient enough. Based on the above work, this paper proposes a dynamic scene search algorithm based on result feedback, thereby improving the safety test efficiency of the visual perception module in the automatic driving system. More details will be covered in detail in Section 1.5.

In this work, by analogy with the deep learning black box attack strategy, with the help of the openness of the real-time rendering engine, a reliable and interpretable security test system for the visual perception module of autonomous driving is designed and proposed. The main contributions are in three areas:

1) Aiming at the safety problems of the awareness module of the automatic driving system in the scenario of the Internet of Vehicles, this work proposes and designs a scene-driven black box safety test system. Compared with the existing work, the system introduces a dynamic testing strategy based on result feedback, and continuously adjusts the generation of input data in the loop through an adaptive mechanism to achieve efficient and stable non-intrusive security testing of the perception module.

2) Aiming at the new requirements of dynamic testing of the visual perception module of automatic driving, this work designs a fine-grained scene description method, an adaptive dynamic search algorithm and a dynamic test technology for automatic system defect analysis, and optimizes the proposed black-box safety test system from three aspects: test granularity, feedback adaptation and interpretability.

3) In this work, the test system was verified on two representative open source autonomous driving systems, 3000 scenarios were dynamically generated, and 1939 and 1671 fault scenarios were found respectively, with an average of one fault scenario found in every 16.86s. Experiments have shown that thanks to the dynamic adaptive scene search method, the security problem discovery rate of this work is 1.4 times that of the existing scene-driven dynamic testing direction based on the real-time rendering engine.

Related work

Autonomous driving tests based on neuronal coverage

Neuronal coverage is designed to be analogous to branch coverage of traditional programs. This type of work defines that when the test input passes through a neuron so that the neuron output satisfies a certain state, the neuron is said to be covered (activated) by the test example. This type of work, with the optimization goal of maximizing neuronal coverage, looks for input examples. Since DeepXplore introduced the concept of neuronal coverage and successfully applied it to the field of vision-based deep autonomous driving, a lot of work on neuronal coverage has emerged, various coverage standards have been proposed, and traditional software testing methods such as metamorphosis testing, fuzz testing, and symbol testing have been successfully migrated to deep learning testing tasks. However, this analogy is a mechanical analogy, and the state of the output value of neurons is a completely different concept from the branch in traditional software testing, so the effectiveness of this analogy method has always been questioned. Moreover, the design based on neuronal coverage is also difficult to give the cause of the failure of the semantic-based test example, which is not conducive to the further improvement of the safety of the automatic driving system.

Autonomous driving testing based on error injection

AVFI uses software fault injection to simulate hardware failures in autonomous driving systems to test the fault tolerance of the system. DriveFI then used Apollo and DriveAV to conduct experiments on the CARLA simulator and DriveSim, using the Bayesian network to simulate the verification process after the automatic driving system to accelerate the error injection, which can maximize the detection of faults affecting the automatic driving system. Kayote builds on the above work by adding the ability to describe error propagation and the ability to inject bit flipping directly into the CPU and GPU. These efforts examine the characteristics of the autonomous driving system from a calm and error perspective, and the actual investigation is the hardware failure of the automatic driving system. This type of work is orthogonal to the scenarios and problems discussed in this work, which studies the safety of software for autonomous driving systems, especially its visual perception modules.

Search-based autonomous driving testing

The core idea of search-based autonomous driving safety testing is as follows: given the input space of the automatic driving system to be tested, define the special input state of the system to be measured, and determine which inputs will cause the system to output special states by searching in the input space, so as to realize the division of the input space. Abdessalem uses the search algorithm to label the input in the input space, and at the same time uses the label data to train the classifier, and divides the decision boundary of the input space. Subsequently, he extended the search process of the state space of the less-body problem to the search in the state space of the multi-body problem, and proposed the FITTEST search test method. The limitation of such an approach is that the introduction of classifiers implicitly assumes that the input space is locally continuous and linear, for example, when dividing using a decision tree, it is possible to think that there are also positive case states between the two normal state. For systems with problems such as AEB systems defined under the linear domain, the design of the classifier is reasonable, but for deep learning systems with highly nonlinear systems, such an approach is obviously not applicable. In addition, Wicker used the idea of a game of both, manipulating pixels on the picture, using the progressive optimal strategy of the Monte Carlo tree search game to find counterexamples that caused model errors. The search space of such a search strategy is pixel-level, and the search results do not have realism in reality.

Test methods based on real data

Test methods based on real data are mainly divided into 2 types: 1) improve the quality of their autonomous driving systems by collecting a large amount of user driving data, such as Tesla; 2) Real-life testing, the use of prototype vehicles in a real road environment for testing, taking into account safety factors, the test conditions of this method are more stringent. In addition to the high cost of data collection, these methods also have a very limited distribution of collected data, which makes it impossible for test systems to detect the safety of autonomous driving systems in new environments; At the same time, excessively collected driving data also has the problem of violating user privacy.

A test method based on the generated data

There are two main types of test methods based on generated data:

1) Based on generative adversarial networkworks (GAN), in DeepRoad, consider transforming the road scene in normal weather to rain and snow weather, so as to test the safety of the system in rain and snow weather, but the scene richness of its generation method is insufficient, and it is impossible to achieve flexible control of scene content;

2) Create test scenes based on real-time rendering engines. Richer used the game GTAGV to create the self-driving dataset, Sythia used the Unity engine to synthesize the dataset, and CARLA0.8.X used UnrealEngine to create driving circuits for testing the system. The problem with this work is that the test scene is pre-defined and cannot or can only be adjusted to a limited extent, which does not take full advantage of the fact that real-time rendering can dynamically adjust the scene. In particular, testing with fixed driving scenarios makes it difficult to cover all possible scenarios.

Scenic has designed a scene description language that can generate scenarios for safety testing of the autonomous driving visual perception module based on predefined rules. Although this descriptive language has a high degree of freedom, it is difficult to portray non-physical objects in the scene, such as weather and sun, and it is difficult to apply to the situation of scene transformation. The latest work is that Paracosm proposes a programmable method for generating autonomous driving test scenarios. The work is used to test scene generation by parameterizing objects and environments in the scene and providing a set of programming interfaces, which are based on random search for scene generation and testing visual perception modules in autonomous driving. However, considering the large scale of the searchable parameter space, it is difficult to efficiently find the security problem of the perception module through random search. We have made a detailed comparison and assessment of this in Section 4.1.

Compared with the Scenic method, this method adds the description of non-physical objects, so that the safety of the automatic driving visual perception module in different weather conditions is also fully tested. The second improvement is to propose a set of adaptive scene search algorithm, compared with the Paracosm method, this paper can achieve adaptive dynamic fault scene search, which makes the safety problem detection efficiency of the automatic driving visual perception module significantly improved.

System design

This section will introduce the specific design of the safety test system for the autonomous driving visual perception module, including the formal description of the safety test system, the workflow, the scene description method, the dynamic scene generator, and the defect analysis method. Until then, we have summarized some of the key variables used in this article in Table 1.

Table 1 The variables used in this method are described

10,000 words to interpret the safety test of the visual perception module in the automatic driving system

A formal description of the security test system

An autonomous driving system is essentially a policy π that maps to control instruction A, π:OA, given the sequence of environmental information O. The input environmental information includes camera image I and current vehicle speed V; The output control command consists of a brake, throttle, and steering angle, where the brake can be thought of as a reverse throttle, so that the output control command can be expressed as (s,t).

The test system based on scene search first generates a specific scene w∈W, which represents the environment information of o= (w) ∈O. Given the autonomous driving model m=M, the vehicle can be obtained

Workflow of a security test system

The architectural design of the test system is shown in Figure 1. In order to be able to control scene generation precisely, this article designs a set of attribute configuration schemes that control the properties and interrelationships of individual objects in the scene (Section 2.3). 1) The Dynamic Scene Generator (Section 2.4) first reads the configuration file, gets the distribution function of the objects in the scene, and randomly samples an initial scene description. 2) A real-time rendering engine (such as UnrealEngine) renders a driving scene that can be tested by the system based on the scene description. 3) The driving model to be tested reads the scene. 4) Output decisions to the defect analyzer for analysis. 5) The defect analyzer (section 2.5) generates constraints based on the model output, adaptively guiding the dynamic scene generator to generate new scene descriptions for the next round of testing. 6) If the defect analyzer finds a safety problem in the system to be tested, it will automatically generate a defect report for subsequent improvement of the automatic driving system. The core technologies and components in the system are described below.

Figure 1 Workflow of the test system

Rich and realistic scene descriptions

To achieve rich and realistic scene descriptions, we designed object property profiles and environment configuration schemes to describe the asset (asets) properties, property distributions, and addition of new maps required by the real-time rendering engine; At the same time, we have parameterized the description of all the objects in the scene for automatic search of defective scenes.

Object description and environment configuration

The system divides the objects that appear in the scene into 5 categories:

1) Environment E is the base road environment for presets for scene generation. A road environment should include at least roads and roadside buildings. In order to ensure the authenticity of the described scene, such as the reasonable location of the limited object, we have divided the environment into zones. A typical environment consists of off-road non-driving areas, sidewalks, left and right lanes, and intersections.

2) Weather W includes solar altitude angle, rainfall, fog concentration, etc., and its value is continuously variable. A basic weather is expressed as a function of probability distribution density over a range. There may be interactions between the weather, resulting in a joint probability distribution. Therefore, we need to set up multiple weather distributions for a single scene and sample them using the joint probability distribution function.

The correlation between the weather distribution and the environment configuration is very low, and if the correlation between the weather distribution and the environment is considered, it is more reasonable to directly change the weather distribution, for example, the probability of heavy rain in arid areas is much smaller than in wet areas; The distribution of vehicles, pedestrians and static objects is environment-dependent.

3) Vehicle V is a vehicle for collision simulation and gravity simulation in the environment, including cars, bicycles, motorcycles, in particular, from a realistic point of view, bicycles and motorcycles will add an additional driver to the vehicle. The probability of vehicles appearing in different areas of the environment is different; We also set up two states for the vehicle, normal and abnormal, and constrained the probability distribution of the vehicle in each region in different modes. For example, under normal circumstances, a vehicle would not appear on a sidewalk or in a reverse lane under any circumstances.

4) The description of pedestrian P is similar to that of the vehicle, and the difference in design is that there is no category difference between pedestrians, only clothing, body shape, and appearance.

5) Static item G is a solid object that does not use collision simulation and physics simulation. If the interaction is not considered to directly sample the initial distribution, it is likely to produce a mold-piercing problem, so we use the oriented bounding box (OBB) collision detection algorithm based on geometry calculation, the idea is to surround each solid object in space with OBB, and judge whether a collision has occurred by calculating whether the OBB between different solid objects overlaps.

Specifically, the object is projected onto the ground first, and OBB is used instead of the three-dimensional object for intersection checking. At the same time, in order to make up for the loss of the three-dimensional object in the vertical ground dimension, the concept of layers (Figure 2) is introduced, and there is an OBB projection of the object on each layer, and the collision detection needs to be carried out simultaneously for multiple layers of the object. When the scene is generated, static items are randomly selected and added to the environment in turn, and if the new object does not collide with the old object, the item generation is valid.

Figure 2 Multilayer projection OBB

Parameterized description

For the same solid object, because it behaves differently in matrix D, it is naturally distinguished. In other words, multiple duplicate objects are represented by multiple lines in our representation.

Dynamic Scene Generator

To be able to adaptively test the test system for safety, we designed a dynamic scene generator. The use of the generator is divided into 2 stages: 1) the scene initialization stage. Before each round of testing begins, the dynamic scene generator will sample once according to the environment configuration and object description, in the order of weather, vehicles, pedestrians, and static objects, synthesize the distribution function of the objects, and generate a scene description. 2) Usage phase. The Dynamic Scene Generator can dynamically generate the next scene for testing based on the output of the Defect Analyzer. At the heart of the dynamic scene generator is an adaptive scene search algorithm that can generate different search methods for different systems under test, enabling the test system to quickly and stably find defects in the system under test.

An assessment method based on metamorphosis testing

Test predictions are used to determine whether the output of the system under test is correct. For autonomous driving systems, the correct output is difficult to define in a specific scenario. This is because, for autopilot systems, the output within a certain range does not cause driving errors. In addition, since the vehicles controlled by the autonomous driving system are physically continuous, a single system error output may not cause serious safety consequences. Therefore, it is unreasonable to take a specific value as a test prediction. Continuing from our previous work, we took the metamorphosis test and laxized it as a test prophecy. Since different test scenarios require different test predictions, when using our test system, different prediction rules can be designed according to experience. Here we give 2 kinds of test predictions: usually the deep autopilot test focuses on the correctness of the output steering angle, because the steering angle often determines whether the system will cause dangerous consequences, and Prediction 1 uses such a design; However, when there is a vehicle in front of the test vehicle, but there is no brake due to the change of scene, it should also be determined that the automatic driving system is wrong, and Prediction 2 uses this design.

Adaptive scene search algorithm

This search algorithm has 3 design requirements:

1) The search algorithm should be a reasonable scenario and not go beyond the reasonable state defined by the environment configuration and object description.

2) The adjacent 2 scenes searched should be driving semantically unchanged.

3) The search algorithm should be efficient, and the time taken for a single search should not be too long.

We ensure the rationality of the searched scene by rejecting sampling. That is, each time you transform the scene, you have to verify the plausibility of the scene, and if it does not conform to the distribution defined in the original object properties file, you re-transform it. Driving semantic invariance is guaranteed by controlling the position of an object at a certain distance in front of the vehicle where the camera is located. For example, if there is a car in front of an autonomous vehicle that will cause the braking behavior of the vehicle, the spatial position of the vehicle should not be changed during the search process, only the direction angle and the color of the vehicle can be changed. Finally, the efficiency of the algorithm is guaranteed by the random search design of variable steps, and during the search process, the step size budget is adjusted according to whether the last search process was accepted and selected. For a detailed description of the algorithm, see Algorithm 1.

Figure 3 is a schematic diagram of a scene search process performed by the test system. In the figure, the red line frame is a dynamically generated vehicle, the blue wireframe is a dynamically generated pedestrian, the green wireframe is a dynamically generated Static Mesh object, and the weather affects the overall rendering effect, such as the shadow of the building and the shadow of the tree in Figure 3, the swing angle of the leaves, etc. The test system dynamically adjusts the spatial position of these objects and their internal properties, changes the rendered picture, and looks for the scene that caused the failure.

Figure 3 Schematic diagram of a scene search

Accurate and automated defect analysis

For scenarios that result in a schematic function value of 1, the defect analyzer analyzes and explains which objects or properties are causing system exceptions. It should be noted that the reason for the model output anomaly is a whole path of the search process, not a specific iterative process; For systems with highly nonlinear deep learning modules, it is difficult to determine exactly what is causing the system anomaly simply by analyzing the path.

We looked for the cause of the problem with the autopilot system by zeroing the weather in turn and removing the physical objects in the scene. The causes of errors in autonomous driving may be coupled to each other, for example, vehicles coming from opposite sides stop due to foggy days being identified incorrectly. In order to find a group of objects that cause errors, the strategy of iterative greedy search is adopted, and the scene search is carried out with δ0 as the stop sign. For details of the algorithm, see Algorithm 2.

System implementation

System under test

CiLRS, the best conditional automatic driving system at present, was selected as the test object, which used ResNetG34 as the convolutional neural network for image feature extraction, and the weight parameters were pre-trained models trained using CARLA's NoCrash dataset. At the same time, in order to show the ability of the test system on different autonomous driving systems, the basic conditional automatic driving system CIL was also selected as the baseline for comparison. Finally, CIL and CILRS were encapsulated and deployed into the test system.

Test platform

From the perspective of the richness of prefabricated assets and the flexibility of API usage, carla 0.9.11 was selected as the test system development platform. For the purpose of conveniently importing static assets and maps, we compiled the Unreal Engine 4.24 engine and CARLA0.9.11 from source and deployed them on the Windows platform. At runtime, start CARLA from the UnrealEngine editor to quickly iterate over the build environment and verify the correctness of the scene generation algorithm.

Scene building

Limited by the efficiency of deep learning model derivation, the input image resolution of CNN networks typically deployed in deep autonomous driving is not particularly high, and the data captured by the camera needs to be preprocessed to crop out the region of interest (ROI). In this case, other parts of the scene that are farther away from the road surface are not captured in the camera frame. However, the height of the buildings on both sides of the road, which is relatively high or low, will actually affect the lighting effect of the road, which will definitely affect the prediction of the automatic driving system. To test this effect, we designed 3 environments, namely wilderness, countryside, and city, with different environmental object heights, created through the UnrealEngine editor, as shown in Figure 4:

Figure 4 3 environments

The 10 weather parameters (solar azimuth, solar altitude angle, cloud cover, rainfall, accumulated rainfall, wind intensity, air humidity, fog concentration, fog distance, fog density) provided by CARLA 0.9.11 by default were selected as the weather parameters that can be adjusted by the test system. Among them, in different scenarios, the solar azimuth and the solar altitude angle are the weather parameters that must exist. There is a correlation between fog concentration, fog distance, and fog density, and it must exist at the same time.

We measured the attributes of objects provided by CARLA for 89 effectively generatable static objects, 28 types of vehicles, and 26 types of pedestrians. The differences between the 28 vehicles are reflected in the differences in size, shape and color that come with the models. The 26 types of pedestrians include male and female genders, and the age is divided into three age groups: juvenile, young and elderly. Of the 89 static objects, there are some repetitive contents, such as boxes containing 6 types, but there are only 2 types with obvious differences, and there are some objects that should not be used as content that can be dynamically generated on roads and sidewalks in the scenes we design, such as swings. In the end, all 28 types of vehicles and 26 types of pedestrians and 15 representative static objects were selected, and the measured data was written in the object attribute file according to the format requirements.

The real-time rendering engine requires a vehicle to control autonomous driving and a camera sensor to capture an image of the current scene, and since the physics simulation of the vehicle in CARLA uses the same blueprint as an implementation, it does not matter which vehicle is chosen as the control vehicle on this platform. Select the Tesla Model3 in 28 kinds of vehicles as the control vehicle, the sensor selects the ordinary RGB monocular sensor, located 1.6m in front of the relative vehicle center, 1.4m relative to the ground height, FOV takes 100, the picture resolution is 800x600, and the frame rate is 25Hz. The scene captured by the camera is shown in Figure 5:

Figure 5 Footage captured by a vehicle camera in a rural environment

Experimental evaluation

Comparative analysis of security problem detection capabilities

We have replicated the latest autonomous driving test system, Paraxosm, and used the reproduced Paracosm to conduct safety tests on the autopilot systems CILRS and CIL at the same time as our work. We chose CARLA0.9.11 as the scene generation platform for Paracosm and this system. Considering that Paracosm does not specify how to set specific security issue detection criteria, its future work section discusses a test prediction generation approach similar to that of this document, but does not specify specific methods and parameter selections. Therefore, in the experiment in this section, in order to be fair to the experiment, the reproduced Paracosm system and the same as this paper, the metamorphosis test is selected as the test prediction mechanism, and the results of the output of the automatic driving system before the scene transformation are regarded as correct outputs, and the detailed design is described in section 3.4.

For the parameter selection part, if only the weather change occurs in the search scene, the system will select prediction 1 as the basis for fault judgment, and take ε = 0.17, that is, when the deflection angle of the vehicle relative to the original output exceeds 15°, it is considered that the fault has occurred; For the transformation of the entity object, the system selects prediction 2 as the basis for fault judgment, in the case of a solid object, whether the brake is properly braked should be the standard for fault judgment, and at the same time take ε1 = 0.17, take ε2 = 0.2. For scenarios where solid objects exist in the scene, the number of entities is sampled using a normal distribution:

We selected three types of environments, namely wilderness, countryside and city, and conducted safety tests on the CILRS system and the CIL system. Note that the main difference between the 3 environments is that the heights of the buildings on both sides of the road differ, which affects the lighting of the ROI in the cameras on autonomous vehicles. Generate the vehicle in the straight section of the scene and set the vehicle's driving branch to start testing along the road. Let the two test systems dynamically search for 300 scenarios in different types of environments. We show in Table 2 the safety issue detection rate of this work and the Paracosm system, and the calculation method of the detection rate is shown in Equation (2).

Table 2 Security problem discovery rates of this work and the Paracosm system

In the 300 scene searches of the three representative environments, the safety problem discovery ability is better than that of the autonomous driving test system Paraxosm. Overall, the detection rate of security issues on both systems was 1.4 times that of Paracosm. Experimental results show that the adaptive search algorithm design of the test object of the system is more efficient than the non-adaptive algorithm.

Specific analysis of security problem detection capabilities

This section provides a detailed analysis of the systems designed in this article. In order to exclude the influence of other factors, the wilderness environment in the three environments designed was selected as the test environment. Its experimental parameters are the same as those in Section 4.1. Based on this setting, CIL and CILRS were experimented on separately, and 1000 rounds of scene search were performed for each system. Use equation (2) to calculate the failure rate and list the results in Table 3.

As you can see from Table 3, scene transformations cause a higher failure rate when solid objects are present than in weather-only situations. And with the weather and all physical objects in mind, our failure discovery rate for CILRS reached 58.4%.

Table 3 Failure rates of autonomous driving systems

Comparing the experimental results of CILRS with the experimental results of CIL, it can be found that the two perform similarly in the case of only weather, and in the case of having a solid object indicating the degree of crowding of the scene, the incidence of CIL failure is higher than that of CILRS. CILRS was trained on the CARLA100 dataset to focus on predicting the correctness of driving systems in congested scenarios. The experimental results confirm that CILRS does alleviate the problem of high failure rate of automatic driving system in crowded scenarios. In other words, CILRS is better secure than CIL.

Relaxation limits for metamorphosis tests

In section 3.4 of the test prophecy definition, when using a slack metamorphosis test to avoid taking a fixed truth test, the autonomous driving system test may have a large number of false positive problems. In Section 5.2, experiment with weather scenes that do not contain solid objects using ε=0.17, and experiment with scenes containing solid objects using ε1=0.17 and ε2=0.2. The selection of these 2 values is empirical. This section analyzes the effect of the relaxation limit on the results by adjusting the ε for experimentation.

Choose the road-along mode of the CILRS system and test it on the straight, testing the system's ability to generate all solid objects. Fix ε1=0.17 and fix ε2=0.2, respectively, and perform 100 experiments each fixation, and plot the failure discovery rate in Figure 6.

Figure 6 Use metamorphosis to test the relationship between failure discovery rate and slack limit

In Figure 6, the polyline ε1 is the value of ε1 when ε2 = 0.2 is fixed, and the value of ε1 is taken [0,0.5], and the fault discovery rate is related to the relaxation limit. The polyline ε2 is the value of ε2 when ε1 = 0.17 is fixed, and the fault discovery rate is related to the relaxation limit. Note that when ε = 0, the failure discovery rate is not 100%, which is due to the fact that ε1 and ε2 do not take 0 at the same time.

From Figure 6, it can be observed that the failure discovery rate continues to decrease as the relaxation limit gradually increases. If the slack rate is small, the system will report a large number of false positives. If the slack rate is large, the system ignores possible dangerous errors. As explained earlier about the difficulty of designing autopilot test predictions, how to compromise the slack rate is a complex problem, and the statistical standard deviation is used as the relaxation limit in DeepTest, and DeepRoad directly uses empirical values. According to this figure, the ε in the range of 0 to 0.1, with its own increase, the failure discovery rate declines faster, and it is speculated that a large number of false positives are excluded, while the relaxation limit is smoother at 0.1 to 0.22, which can be seen as a more reasonable value range. In addition, for autonomous driving tasks, a slightly higher false positive can avoid the wrong troubleshooting of fault scenarios, which is not unacceptable for actual tasks. This is due to the degree of hazard of the failure of the autonomous driving task.

Scene coverage analysis

This section tests the deep automatic driving system from different environments and different driving modes to verify the coverage ability of the test system.

Environmental testing

The main difference between the wilderness, the countryside and the city is the different heights of the buildings on both sides of the road, which affects the different lighting of the ROI in the camera on the autonomous vehicle. In the straight area of the map, CIL and CILRS are tested separately, and the results are listed in Table 4.

Table 4 Test results of the automatic driving system in different environments

Two conclusions can be drawn from Table 4:

1) Tested in three different environments in the wilderness, rural and urban areas, the fault detection rate is relatively close, which shows that the impact of the environment on the fault discovery rate is relatively small on the one hand, and on the other hand, it confirms the basis for our design of different environments - the ROI design in automatic driving makes the automatic driving system pay more attention to the environment on the road surface rather than on both sides of the road.

2) Interestingly, in our scenario, cities in low-light environments should have a higher rate of fault detection than wilderness and rural conditions in normal light or high light conditions, because according to intuition, night driving is more likely to be problematic than daytime driving. However, the data in Table 4 is contrary to the results we expected, and the failure search efficiency in low light conditions is lower. By observing the experimental results, we speculate that under normal lighting conditions, the steering angle of the autonomous vehicle depends on the physical object on the one hand, and the double yellow line in the center of the road on the other hand. Once the double yellow line is partially obscured, it is likely to cause a driving output failure. In low light, the double yellow line is always obscured, which makes the driving output mostly dependent on other solid objects in the scene. As can be seen from the foregoing, the way we designed the driving semantics is maintained so that the physical object that determines the area of driving semantic change has not changed, which instead weakens the search ability of the fault.

Driving mode test

In the conditional automatic driving system, in addition to the images captured by the on-board camera, the control instructions at the upper level determine the branching action that the current driving action should take. The autonomous driving system has 4 driving modes, namely driving along the road, left, right and straight. We chose a straight road in different scenarios to test the driving mode along the road, and selected an intersection to test the left, right, and straight instructions. Note that when testing at different locations on the map, the scene initializer needs to be sampled according to the region configuration to resynthese the object distribution to fit the actual FOV of the on-board camera. The test results are shown in Table 5. It can be seen that in the case of different branches, the security of CILRS is better than that of CIL.

Table 5 Test results of the autonomous driving system under different branches

Vulnerability analysis of CILRS driving systems

Section 2.5 describes how to further explain the security issues found by the test system. In this section, the CILRS system is used as an example to demonstrate the automated testing capabilities of the test system by answering the following questions.

Question 1. What weather is more likely to cause cilrs system failure?

Use the defect analyzer to zero out the weather in experiments with control over all objects to determine which weather is more likely to cause the autopilot system to fail. Since different weather is obtained by sampling, the amount of sampling is different, so use the ratio of the weather that caused the failure to the number of occurrences of the weather as a comparison value, and the results are displayed in Figure 7, note that the weather may only be one of the causes of the failure, not necessarily the determining factor.

Figure 7 Proportion of CILRS system failures caused by weather

The defect analyzer gives the cause of the failure may be multiple objects, so the sum in the graph is greater than 100%. As can be seen from Figure 7, the accumulation of rain that causes blurred road information (especially road signs) is the most likely to cause instability in the autonomous driving system. This was followed by rain that interfered with the camera sensor's picture. The intensity of the wind mainly affects the tilt of the rain falling from the rain and the blowing of leaves on both sides of the road, which belongs to the environmental content and according to the results given in Table 4, its impact on the automatic driving system is not prominent. It can be seen that the CILRS system is the most susceptible to interference from the rainfall environment among these weather conditions.

Question 2. Which areas are critical to the CILRS driving system?

The critical area of deep autonomous driving is defined as when a physical object appears in this area, and the output of autonomous driving is more prone to instability. The results of experiments previously tested on straights with solid objects are analyzed. First of all, the searched scene that caused the failure and the original scene are handed over to the defect analyzer for processing to analyze the object that caused the failure. Next, the area of the faulty object is plotted in Figure 8, with the abscissa being the x-axis along the road direction and the ordinate being the y-axis in the tangential direction of the road. In our setup, the coordinates of autonomous vehicles are (0,-2.27. The statistical results of Figure 8 show that the sensitive areas of the CILRS system are on the sidewalks on both sides of the road.

Figure 8 Critical areas of the CILRS driving system

Question 3. Which objects are more likely to cause CILRS systems to fail?

The ratio of the object causing the system failure to the number of times it is sampled is used as a criterion for comparing the probability that the object will cause a CILRS system failure. Remove duplicate objects, take the 5 solid objects with the highest error rate, and list the objects and their failure rates in Table 6.

Table 6 Entity objects and their failure rates

Through Figure 8 and Table 6, 2 points can be found:

1) Vehicles located on the road may not be the main cause of instability, on the contrary, objects on the sidewalks on both sides of the road may be more likely to cause instability in the automatic driving system. Looking at the CARLA100 dataset, we found that the self-driving system was trained for complex road conditions, ignoring the complexity of objects on the sidewalk.

2) Yellow and red are sensitive colors for CILRS systems. This is natural because the color of the traffic lights happens to be yellow and red, and when a yellow or red object appears on the sidewalk, CILRS is likely to misjudge it as a signal light.

The vulnerability of CILRS found by the test system is exactly the direction of CILRS optimization, and the system can be verified security after network training using this document. The network optimization scheme can be data enhancement, structure optimization, etc., and then use the test system for verification after optimization to determine whether the deep network meets the security requirements. For example, for the CILRS system, it is recommended to increase the training data of rainy weather and richer scenes on both sides of the road to improve the stability of the system, and design a dual insurance mechanism to alleviate the sensitivity to yellow and red objects.

Feature analysis and heuristics

To further analyze the causes of the CILRS system failure, we opened the CILRS system and looked at the feature extraction layer of ResNet. We found that in most cases, the feature extraction results of the fault scene and the original scene were different, but the difference was not particularly obvious, and in a series of analyses, we found an interesting example of a pedestrian dressed in red walking on the sidewalk.

As shown in Figure 9, the setting environment is on the wilderness, the vehicle is located on the straight road area, the test is carried out using the roadside driving mode, at this time the speed is 4 (normalized value), and a pedestrian dressed in red walks forward along the sidewalk in the center of the right sidewalk. The autonomous vehicle stays in its initial position, and depending on where the pedestrian is in, the change in the output predicted by the CILRS system can be obtained, which is plotted in Figure 10.

Figure 9 Example of failure analysis: Pedestrians dressed in red walking by the road

Figure 10 The output of the CILRS system changes with the position of the pedestrian

As you can see in Figure 10, the output of the CILRS system is relatively stable before the pedestrians appear. When pedestrians begin to appear, the CILRS system is affected and fluctuated, but is always out of range. About 7.5m before the vehicle's center, the predicted output changes to slamming on the brakes (the braking effect takes precedence over the steering effect). Subsequently, the output of the autonomous driving system remains fluctuating, and the output of the vehicle stabilizes until the pedestrian is far away (bias<0.1). Such a performance is obviously abnormal, when the pedestrian is relatively close to the car, if the CILRS determines that the brakes should be braked at this time, the brakes should be output immediately, and the brakes should not be output after the pedestrian continues to walk forward. The camera input and the output of the cilrs convolutional layer 3 layers before the CILRS convolutional layer when there are no pedestrians and pedestrians at 7.5m in front of the center of the vehicle are plotted in Figure 11. For comparison, placing a bench on a sidewalk that causes the largest change in the output of the autopilot system, the offset of the autopilot output relative to the absence of any object ster_bias = 0.030<0.17, throtle_bias = 0.05<0.2, will not be judged as a system failure. In Figure 11, the result of the convolutional output of layer 3 of the scene with a bench is similar to the convolutional output of the scene without any object, while the convolutional output of the scene where the pedestrian appears is significantly different from the convolutional output of the scene without any object.

In Figure 11, the first two columns have obvious differences between the original picture and the first 2 layers of feature extraction, but they are relatively similar in the third layer. The original picture of column 1 and column 3 is very different from the first 3 layers. This shows that although objects appear in the same position in the picture, the content extracted from the features is different. This suggests that we may be able to set up monitors at one or more feature layers to provide early warning of system failures through the monitor's change analysis.

Figure 11 CILRS input and convolutional layer output

System operational efficiency analysis

This section analyzes the operational efficiency of the test system. The hardware platform for deploying the test system is AMDR53600X+RTX3070, and the software platform used is Windows 10+Unreal Engine4.24.3+CARLA0.9.11+Pytho3.9.1.

The dynamic scene generator is defined as a test time based on the initial scene search for a scene that causes an error in the autonomous driving system or exceeds the iteration budget. Considering all solid objects, the number of objects is sampled according to the normal distribution, including the rendering process, the average time of an experiment is 16.86s. Excluding the rendering process, the efficiency test of each module and its internal details is averaged, and the results are shown in Table 7.

Table 7 Average time consumption of each module of the test system (ms)

Compared with the time spent on a single experiment of more than ten seconds, the internal time consumption of each module is very low, and the main performance bottleneck of the system is rendering efficiency.

In order to ensure the safety of the automatic driving system in the scenario of the Internet of Vehicles, this paper designs and implements a set of scene-driven safety test system for the visual perception module of automatic driving. The system constructs a set of real and rich scene description methods, which greatly expands the data distribution of the test system; The system can dynamically generate safety test schemes for different automatic driving models, and can realize the efficient and stable detection of safety defects; Finally, the system is designed with a sophisticated set of automated security analysis tools that can help autonomous driving developers quickly locate the safety issues of the system. We believe that this work will inspire more test solutions for autonomous driving perception modules, which will provide an important safety foundation for the field of autonomous driving in the context of the Internet of Vehicles.

Reproduced from the intelligent car developer platform, the views in the article are only for sharing and exchange, and do not represent the position of this public account, such as copyright and other issues, please inform, we will deal with it in a timely manner.

-- END --

10,000 words to interpret the safety test of the visual perception module in the automatic driving system

Read on