Swastika analysis automotive millimeter wave radar point cloud technology

--Pay attention to the reply to "information" and receive Tesla patent technology analysis report--

The semantic segmentation of radar point clouds is a new challenge in radar data processing. We demonstrate how to perform this task and provide a large dataset of artificially labeled radar reflections. Instead of using the feature vector produced by cluster reflections as input to the classifier, the entire radar point cloud is now taken as input and the class probability for each reflection is obtained. As a result, we no longer need clustering algorithms and manual feature selection.

Over the past few years, image analysis has moved from simply classifying the central object in an image and detecting objects or parts of an object to a single combinatorial task: semantic segmentation. Semantic instance segmentation enhances semantic segmentation by distinguishing pixels of the same class label of physically different objects, so that object instances are grouped in addition to being categorized by pixel.

Semantic segmentation is usually done by deep convolutional neural networks, which typically behave as encoder-decoder structures. These schemas all rely on a regular image structure, that is, a rectangular mesh with equidistant pixels. If a fully convolutional network is used, the dimensions of the mesh, i.e. the width and height of the image, may vary. Rectangular meshes cause distances and neighborhood relationships between pixels that are utilized by convolutional kernels that extend more spatially than one pixel. Therefore, if the camera is used as a sensor, these methods can work properly. Radar and lidar sensors are complementary to cameras to keep functions safe. These additional sensors should not only be complementary, but also redundant. Therefore, it is best to obtain a highly semantic understanding of the surrounding environment from radar and lidar as well.

In this article, we will semantically segment the radar data, that is, we assign a category label to each measured reflection point. We focused on dynamic objects and studied six different categories: cars, trucks, pedestrians, pedestrian groups, bicycles, and static objects. The radar detection results obtained after applying the constant false alarm rate (CFAR) algorithm constitute a point cloud, where point cloud P is defined as a set of N∈ N points pi∈Rd, i = 1,...,N, where the order of the points in the point cloud is irrelevant. For each reflection, two spatial coordinates (radial distance r and azimuth φ) need to be measured, self-motion compensating doppler velocity v r and radar cross-section (RCS) σ. Therefore, the 4D point cloud must be handled in the semantic segmentation task. The spatial density of radar reflections can change dramatically, so methods of large-scale mesh mapping are computationally unworkable. Therefore, the usual network structure for camera images cannot be applied. It is necessary to read from Figure 1 that does not require an input algorithm similar to an image, which shows radar detection data collected from four radars in a period of 200 milliseconds. In this plot, you can see large areas that are not measured as well as areas with a large number of reflections. The grid map of the entire scene has about 2000 individual reflections, which must cover a large spatial area of at least 150 meters × 200 meters, and even at very low resolutions, the cell size is 1 meter × 1 meter, and up to 6% of the pixels in the grid will have non-zero values.

Swastika analysis automotive millimeter wave radar point cloud technology

Figure 1 Radar point cloud accumulates for more than 200 milliseconds. The reflections of three different models are highlighted. Only excerpts of the full field of view are displayed

Therefore, we use PointNet++ as the basis for our segmentation algorithm. Capable of working directly on point clouds, PointNet++ was originally designed to process 3D spatial data from laser scanners. In this article, we modified the schema to handle two spatial dimensions and two additional feature dimensions.

In previous work, classification was performed on eigenvectors, which in turn were obtained from the radar reflections of clusters. With our new approach, we avoid these two preprocessing steps: grouping radar targets into clusters, and no longer needing to generate predefined feature vectors from these clusters. These show that our new approach is significantly superior to previous ones.

The rest of this article is structured as follows: In the second part, we commented on the work and other approaches to the topic. After that, we described our network structure in more detail and explained our training and testing procedures. In the fourth part, we presented our results and compared them to previous methods. Finally, we look ahead to our future work.

Related work

Semantic segmentation is a popular method when the camera is used as a sensor and most algorithms are customized for image data. The introduction of fully convolutional networks inspired many similar and later more advanced neural network structures such as SegNet, U-Net, R-CNN, and their successors Fast R-CNN, Faster R-CNN, and Mask R-CNN. In order to apply these techniques to radar data, some pre-processing must be done. Grid maps provide a way to convert spatially non-uniform radar reflections into image data. The measured reflections are integrated over time and inserted into the corresponding locations in the map. In this way, you can create different maps, such as an occupancy grid map that describes the posterior probability of grid occupancy, or an RCS map, which provides information about the measured RCS values reflected in each mesh. This method works well for static objects because only by considering self-motion (rather than extra object velocity and trajectory) allows you to insert radar reflections of different times at the correct location in the map. For dynamic objects considered in this work, precise extended target tracking algorithms are required, or the dynamics of the object are considered characteristics so that the dynamic objects create extended reflective tails in the map. Another difficulty is that for sparse data, mesh mapping is not efficient because potentially large meshes are required to display relatively few measurements.

As far as we know, there has been no semantic segmentation of car radar data from moving objects before. Classification is done only on small data sets or large amounts of simulated data.

method

Network structure

Qi et al. provide PointNet and PointNet++ methods to work directly with point clouds, so the previous mapping steps are not required. They performed semantic segmentation of 3D point clouds obtained by sampling points from a 3D scanned mesh of indoor scenes. We use their architecture as the basis of our approach. However, the radar data we used in our experiments differed from the 3D indoor data in the following ways. First, each radar reflection point contains only two spatial coordinates instead of three, but with the addition of the self-compensating Doppler velocity and the two added values of the RCS value, each point pi of the entire point cloud is four-dimensional. Second, our data show greater differences in density and sampling rate. The 3D scan of Stanford's 3D Semantic Analysis dataset provides a high-density point cloud where details inside the office can be seen, while our radar data provides only a small amount of reflection for each object, so for smaller or more distant objects, even the outline of the object cannot be captured correctly, see Figure 1.

The Multiscale Grouping Module (MSG) and feature propagation module (FP) are defined in PointNet++. The MSG module considers multiple scale neighborhoods around center points and creates combined feature vectors at locations that describe the center points of those neighborhoods. The module consists of three steps: selection, grouping, and feature generation. First, select the Nsample point of the input point cloud by sampling the farthest point in order to sample the input point cloud evenly. In the grouping step, create neighborhoods for each selected Nsample point. In our network, neighborhoods consist of Nneigh points located within a radius r around the center point. Only two spatial components of radar reflection are considered for neighborhood search. If a reflection point has more than Nneigh neighborhoods within a given search radius, only the first Nneigh point found is used for further calculations If fewer reflection points are found, the first neighborhood is repeated to guarantee a fixed-size data structure. In each MSG module, multiple neighborhoods with different r and Nneigh values are created. In the final step, a feature is generated for each Nsample point by applying a convolutional layer with filter size 1×1 on a neighborhood tensor with shape (Nsample, Nneigh, cin), where cin is the number of channels. This will produce a tensor of size (Nsample, Nneigh, cout) on which a final maximum set layer is applied so that only the contribution of the neighbor with the highest activation for the corresponding filter is considered.

After passing through the MSG module, the number of points in the output point cloud is smaller than in the input point cloud, so the points in the deeper layers contain more and more abstract features that provide information about the domain points of the previous layers. This process is similar to a convolutional network for image processing, where the image size is reduced at each layer. In Figure 2, the doppler velocity of the spatial position and self-motion compensation of radar reflections is shown, and a subsampling of the input point cloud is described after each MSG module. The high-dimensional feature vector generated for each point in the MSG module is not shown in the figure. The camera image of the scene is shown in Figure 3.

For semantic segmentation, the information from the subsampled point cloud is propagated to the full input point cloud.

Figure 2 an example radar point cloud is excerpted. Plot spatial coordinates as well as the self-compensating Doppler velocity. From left to right: The point cloud at the input layer and the subsampled point cloud after the first, second, and third MSG modules. The data accumulates for more than 500 milliseconds. A camera image of the scene can be found in Figure 3.

Figure 3 Camera image of the same scene as Figure 2

This task is performed by feature propagation modules: the k-layer MSG module is followed by the k-layer FP module, which repeatedly propagates the features of the less populated point cloud to the next higher layer. For each point pi in a dense point cloud, a weighted average of the feature vectors of the three closest neighbors in the sparse point cloud is calculated and assigned to the point pi after passing the feature vector through a set of convolutional layers. Skipping connections from the corresponding level of the MSG module improves the propagation of features.

Our network structure is shown in Figure 4, where the parameter values of the MSG module are also defined.

data set

In this article, we use only real-world data collected by two different experimental vehicles, Vehicle A and Vehicle B. Vehicle A is equipped with four 77GHz sensors, mounted on the two front corners and sides of the vehicle. Use only the sensor's short-range mode in order to detect targets within a range of 100 meters. The field of view of each sensor is ±45°.

Vehicle B is equipped with eight radar sensors, which have the same specifications as vehicle A's sensors. These eight sensors are mounted on the four corners of the car and on the left front, right front, rear left, and right rear sides of the car.

The dataset of Vehicle A(B) contains driving measurements of more than 4.5 hours (6.5 minutes), that is, more than 100 million (5 million) radar reflections were collected, of which 3 million (100 000) belonged to 6200 (191) different moving objects. All reflections belonging to the same object are manually grouped and annotated with labels from the following categories: Cars, Trucks, Pedestrians, Pedestrian Groups, Bicycles, and Static. Table I shows the reflection distribution of the six categories Unlike our previous work, the miscellaneous points were not studied as an additional category, but were treated as static, because in this work, our goal was to detect and classify real dynamic objects only from the original point cloud. Our previous classifier had to deal with clusters and feature vectors that did not come from real objects, so it was necessary to distinguish between junk classes and real objects. The clusters and feature vectors created by these errors are caused by imperfect preprocessing steps, which we try to avoid here.

Table I Radar reflection distribution for the six categories

Training and testing

Before we can do the actual training, the hyperparameters must be fixed. The number of MSG modules, the number of sample point Nsamples, the number of neighborhoods in each MSG module and their respective radii r, the number of neighborhood points Nneigh of each sample point, and the number and size of convolutional layers in each module must be determined. This is done by checking the reasonable configurations on the randomly selected validation set and changing those configurations to further optimize network performance. Due to the sheer size of the parameter space and the corresponding computational cost, a complete sampling of the parameter space is not feasible.

Figure 4 depicts the final choice of the best performing architecture.

Figure 4 Structure of our network. Red arrows indicate skipped connections through which features extracted from the MSG module are passed to the FP module of the corresponding layer. The core sizes of the three MSG modules are [[32, 32, 64], [64, 64, 128]], [[32, 32, 64], [64, 64, 128]] and [[64, 64, 128], [64, 64, 128]].

To evaluate, we performed a five-fold cross-validation. That is, the dataset is divided into five collection folds, each folding 20% of the data, each fold used for testing once, and the remaining four folds used as training data.

Only data from Vehicle A was used for training. The measurement data from Vehicle B is only used to check the generalization capability of our classifier. The network is trained using stochastic gradient descent and a cross-entropy-based loss function as well as the Adam optimization scheme. We used a portion of the source code in the published tensor flow.

Due to the huge imbalance between static and dynamic data (about 97 million to 3 million), the weight of the loss function of the static class is reduced so that the optimization no longer assigns almost all points to the static class.

The training lasted 30 cycles, during which data enhancement was carried out: random noise was applied to each feature dimension, thus changing the spatial position of the reflection as well as the measured RCS value and the self-motion compensated Doppler velocity. The velocity feature is modified only for reflections of dynamic objects. In addition, a random number q∈ [0,0.3] is generated for each dynamic object, and each reflection of the object is omitted with probability q during that period, changing the shape and density of the dynamic object.

The network itself has no concept of the recording time of a single reflection, but during training, we provided the network with a time window of T=500 milliseconds in length, making the point cloud more dense and allowing for more reflections per object to be taken into account. At the earliest measurement, reflections from different time periods were converted into vehicle coordinate systems.

The input size of the point cloud is fixed at 3072 reflections. If more than 3072 reflections are measured in a 500 ms long window, the reflections of the static category are removed, and if the reflections measured are less than 3072 times, one reflection is resampled to the desired number of times. Due to the largest aggregation layer in the network structure, this oversampling does not change the result of semantic segmentation.

During the test, the next 3072 reflections are passed through the network, sorted by measurement time, so no oversampling or undersampling is required.

The training was done on a Linux workstation equipped with an Nvidia GeForce GTX 1070 GPU.

outcome

Our system was evaluated based on a 6×6 confusion matrix and a macroscopic average F1 score (hereinafter referred to only as the F1 score). The F1 score corresponds to the harmonic average of accuracy and recall.[24] In a macro average, in a macro average, the contribution of each category to the total score is equal — regardless of the count of the categories — because each category calculates a separate F1 score and then averages the six values.

Optimal performance architecture

We start by showing the results obtained using our best-performing architecture. We only use data from Vehicle A for five-fold cross-validation. In addition to the two spatial coordinates x and y (in the vehicle coordinate system), we also enrich the input point cloud with self-motion compensating Doppler velocity and RCS values. Therefore, we provide a four-dimensional point cloud as input.

The resulting confusion matrix is shown in Figure 5.

Figure 5 Relative confusion matrix after 5-fold cross-validation using the network structure described in Figure 4. Input characteristics of the point cloud: x, y, v r, σ.

Not surprisingly, most classes with static labels show the highest true yang values. However, we should know that it is far more difficult to distinguish reflections belonging to moving or non-moving objects than it is to set a threshold on doppler velocity and classify every reflection below that threshold as static. In real-world scenarios, reflections from many non-moving objects show non-zero self-motion compensated Doppler velocities, which are caused by odometer errors, sensor deviations, time synchronization errors, mirror effects, or other sensor artifacts. In addition, a reflection of zero Doppler velocity is not necessarily a static object, as the bottom of a rotating car wheel or the body part of a pedestrian (moving vertically in the direction of walking) may also not show radial velocity.

Objects of the automobile class were classified as suboptimal, again in the pedestrian group. Truck-like objects are often confused with cars. There are two reasons for this confusion: First, at great distances, each object can only measure very few reflections, making it difficult to infer the spatial extent of the object. Second, the transition between car and truck instances is fairly smooth because, for example, large SUVs are difficult to distinguish from small trucks.

Another notable behavior that can be inferred from the graph is the high degree of confusion between pedestrians and groups of pedestrians. This behavior can be caused by our training data, because for human annotators, it is sometimes possible to assign the reflection of two nearby pedestrians to an individual, thus creating instances of two pedestrians, but sometimes this is not easy and the time requirements are too high. Causes all reflections to be marked as a single instance of the pedestrian group. So, in addition to the complex task itself, the network must also struggle with inconsistencies in real data on the ground. For many driving tasks, it's not important to know if there are one or two pedestrians in an area, so that the two levels can be combined to produce a true positivity rate of more than 91 percent.

Because datasets are highly unbalanced, it can be misleading to examine only relative confusion matrices normalized to class counts. Therefore, we also show a confusion matrix with absolute values in Figure 6. This visualization emphasizes that many false-positive dynamic objects are created by the network (last line in the graph). This effect is most pronounced for the automotive class: only 68% of the predicted car reflections belong to dynamic objects (see the first column of Figure 6). However, for automotive applications, a high false-positive rate for dynamic objects may be preferable to a high false-negative rate. Reducing the weights in the loss function of the static class results in higher false positive values, so this parameter allows us to adjust between false positives and false negatives.

Figure 6 Absolute confusion matrix after 5 cross-validation of the network structure described in Figure 4. Input characteristics of a point cloud: x, y, v r, σ.

It should be noted that the percentage of confusion between dynamic reflection and static reflection (the last column of the confusion matrix in Figure 5) does not represent the percentage of objects that are ignored. If only one reflection of a dynamic object is correctly classified, but other reflections of the same object are classified as static, the object is still detected even if the false negative count increases.

Changes in the input features

To get a deeper understanding of what information is useful to the network, we repeat the quintet cross-validation with three different sets of input features f1 = x, y, v r, f2 = x, y, σ f3 = x, y, and compare the results with the original features f0 = x, y, v r, σ. In Table II, the F1 score for each input configuration is shown. The following points can be seen from this table. The more input features presented to the network, the higher the performance. Adding the RCS value of each reflection to the input feature increases the F1 score slightly (from 0.7303 to 0.7425), while adding a self-motion compensated Doppler velocity has a greater impact, resulting in an almost 0.1 increase in the score. Although Doppler speed has a certain importance as a feature, it is interesting that for input features f2 and f3, the performance of the network is still much higher than that of random guesses. This means that the reflected spatial environment is a very expressive feature of the network and lays the foundation for the classification step, which is then classified using additional features of speed and RCS values.

Table II Classification scores for different input features

Data from test vehicle B

So far, only data from Vehicle A have been applied for training and testing, and now we are using a network that trains with only data from Vehicle A to predict the category of reflections measured by Vehicle B. The difference in this setup is in two ways. On the one hand, Vehicle B is equipped with 8 instead of 4 radar sensors, thus providing a 360° view of the vehicle's surroundings, unlike Vehicle A, which is mainly front and side settings. Vehicle A, on the other hand, was collected on urban and rural roads in Germany, while Vehicle B only collected data in the United States. Different road and street designs and average larger cars pose challenges to the algorithm.

Applying our best performing network on top of these new data has an F1 score of 0.46, significantly lower than the value we got with the five-fold crossover. If the four sensors at the front of the test vehicle are evaluated independently from the four sensors in the rear, the F1 score can be increased to 0.48.

Since the dataset for Vehicle B is very small compared to the dataset for Vehicle A, we must interpret the results carefully. However, it is clear that changing the sensor's settings has a certain impact on the performance of the classifier.

Comparison with previous methods

In the previous work, we used DBSCAN for clustering and LSTM networks for classification to generate class labels for sequences of eigenvectors. Previously, we measured the performance of the feature vectors generated on the real cluster on the ground. In this paper, the evaluation of this approach is done by projecting the class labels of the feature vector back into the original reflection of the cluster.

We train the LSTM network and our new method on the same dataset and evaluate both methods on the same test set. For a fair comparison, LSTMs are not tested on eigenvectors of real clusters on the ground, but on eigenvectors generated in clusters obtained by applying DBSCAN on point clouds. Unlike our current approach, LSTM also learns to classify feature vectors as garbage if they come from clusters that are not real objects. If the LSTM rejects such a feature vector, we will treat the relevant points as static in the comparison.

Our new method achieved an F1 score of 0.734 on this selected test set, while the DBSCAN+LSTM method only scored 0.597 points. The new method creates fewer false-positive dynamic objects and has a higher true positive count across all classes. The most attractive feature is that there are three times fewer reflectors that are mistakenly considered static, so there may be fewer objects being ignored. The confusion between reflections from dynamic objects and reflections from static classes stems not only from the poor classification results of LSTM, but also mainly due to insufficient clustering, which makes it impossible for LSTM to classify certain reflections.

visualization

It is useful to visualize the output of different network layers during the forward pass-through of a scene. Figure 2 shows the spatial position of an example scene after the input stage and three MSG modules and the Doppler speed.

The convolutional kernels of the different layers are difficult to visualize because only convolutions of 1×1 are performed, so there is no meaningful image of the filter itself. However, we can pass different scenarios through the network and collect the network output before the last convolutional layer. In this output, we randomly select 1000 points from each class, along with their 128-dimensional eigenvectors, and pass this high-dimensional point cloud through the t-SNE dimensionality reduction algorithm to get a two-dimensional point cloud. This is reflected in Figure 7, where four different clusters of cars, trucks, bicycles, and static categories can be observed. According to the confusion matrix in Figure 5, the reflections from pedestrians or groups of pedestrians are not well separated. Reflections from cars and bicycles enrich the center of the point cloud, showing those points that are difficult to classify. Finally, Figure 8 shows the same scenario as Figure 2, but now shows the prediction class labels instead of The Doppler velocity. All three categories of pedestrians, trucks and cars are correctly identified. However, some of the clutter behind the rightmost pedestrian is incorrectly classified as a pedestrian group, and some reflections behind the car are also incorrectly marked as car classes. Nevertheless, the semantic message of the scene is well expressed.

Figure 7 Two-dimensional embedding of 128-dimensional eigenvectors of the penultimate convolutional layer in our network. Embedding is performed using the nonlinear t-SNE method

Figure 8 Prediction category label for each reflection of an instance scene. Bounding boxes are added manually to correlate point clouds and camera images

Conclusions and outlook

This article uses PointNet++ as the classification algorithm to give the results of semantic segmentation of radar data. We show that our new method is superior to our previous one, which includes two now obsolete preprocessing steps, clustering and feature generation. In addition, we demonstrated that using RCS values and self-motion to compensate for Doppler velocities can improve classification results, where Doppler velocities have a greater impact on classification results.

In our future work, we will focus on two different aspects. On the one hand, integrating temporal information into the network seems beneficial. The temporal evolution of an object is a descriptive feature that should at least improve the distinction between static and dynamic class instances. One possible way to achieve this is to integrate a recursive neural network structure into PointNet++. An easier approach is to present the measurement timestamp as an additional feature. On the other hand, semantic instance segmentation needs to be extended. Currently, we only provide a category label for each reflection, without understanding the concept of the object instance to which the reflection belongs. Therefore, we don't know how many different objects exist in a scene, but only the amount of reflection that belongs to an object category. Class-aware clustering algorithms are a possibility to generate instances from reflection, but combining instance learning and class associations can result in higher overall performance.

Reproduced from the intelligent car developer platform, the views in the article are only for sharing and exchange, and do not represent the position of this public account, such as copyright and other issues, please inform, we will deal with it in a timely manner.

-- END --

Swastika analysis automotive millimeter wave radar point cloud technology

Read on