Text | Xiao Peng's brilliant notes
Editor|Xiao Peng's brilliant notes
Agriculture is one of the basic human activities that ensures global food security, but weeds in farmland can wreak havoc on crop growth and yields because weeds compete directly with crops for sunlight, water and nutrients, and they can also be a source of disease and pests in crops.
Weed control helps to promote sustainable agricultural development, thereby improving agricultural production efficiency, reducing waste of agricultural resources, protecting the ecological environment, and achieving sustainable agricultural development.
Recent advances in computer vision have been greatly facilitated by convolutional neural networks (CNNs), which in contrast to traditional machine learning algorithms perform feature selection automatically and with greater accuracy.
These methods have been widely used in agricultural image processing, and CNNs have been used to predict and estimate rice yields at the ripening stage based on remote sensing images acquired by unmanned aerial vehicles (UAVs), proposing a robust detector based on deep learning for real-time identification of tomato pests and diseases.
A CNN was constructed to classify carrots and weeds at the seedling stage, significantly improving the accuracy of plant classification by utilizing texture information and shape features.
Others have built a large, public, multi-class dataset of deep-sea weeds and used ResNet50 for weed classification.
Since FCN was proposed, UNet, DeepLabV3 and DeepLabV3Plus image semantic segmentation models have emerged and are widely used in the field of agricultural weed segmentation, semantic segmentation models can quickly extract the characteristics of crops and weeds, and there is no need for complex background segmentation and data model establishment in the extraction process.
The soybean dataset was taken from the soybean field of Anhui Agricultural University National High-tech Agricultural Park located in Luyang District, Hefei City, Anhui Province, and soybean seedlings were selected for 15-30 days for data collection.
Devices used for image acquisition, including DJI handheld gimbals, model Pocket2, are located about 50 cm above the ground, video resolution is set to 1920×1080, and the frame rate is 24 frames per second (fps).
Frames were then extracted from the video to obtain 553 images to build the soy dataset, and to ensure faster training and convenient manual annotation, the image size was adjusted to 1024×768, a resolution that strikes a balance between computational efficiency and preserving enough visual detail for accurate image analysis.
Making it a commonly used resolution in many computer vision applications and datasets, they are randomly assigned to the training, validation, and test sets in a 7:2:1 ratio.
Original image and corresponding annotations in the soybean dataset (green: cropped, red: weed)
The beet dataset is derived from BoniRob, images in this dataset were captured at a sugar beet farm near Bonn, Germany, and in 2016, a pre-existing agricultural robot was used to document a dataset that focused on beet plants and weeds.
The robot is equipped with a JAIAD-130GE camera with an image resolution of 1296×966 pixels, the camera is located under the robot chassis, the installation height is about 85 cm above the ground, the data collection time spans more than three months, and the data is recorded about three times a week.
The robot captured multiple stages of the beet plant's growth process, and the official dataset contains tens of thousands of images, in which labels can be divided into three categories: beet crop, all weeds, and background.
For convenience, 2677 randomly selected images were used to create a sugar beet dataset, randomly divided into 70% training, 20% validation, and 10% test sets, with green annotations representing sugar beet crops, red annotations representing weeds, and black annotations representing soil.
Original image and corresponding annotations in the beet dataset (green: cropped, red: weed)
The carrot dataset is derived from the CWFID dataset, and images in this dataset, collected from a commercial organic carrot farm in northern Germany, were captured using a JAIAD130GE multispectral camera during the early true leaf growth stages of carrot seedlings.
The camera can capture visible and near-infrared light, the resolution of the image is 1296×966 pixels, during the acquisition process, the camera is placed vertically above the ground, the height is about 450 mm, the focal length is 15 mm, to mitigate the effects of uneven lighting, artificial lighting is used in the shadowed area below the robot to maintain a consistent illumination intensity on the image.
The dataset consists of 60 images, randomly split 70% of the samples for training, 20% for validation, and 10% for testing, again green annotations for carrot seedlings, red annotations for weeds, and black annotations for soil and background.
Original image and corresponding annotations in carrot dataset (green: cropped, red: weed)
The rice dataset is derived from the rice seedlings and weeds dataset with an image resolution of 912×1024 pixels and was taken with a IXUS6HS camera with an f-s1000–36mmf/360.3–4.5ISSTM lens.
During image capture, the camera is 80-120 cm from the water surface of the field, the dataset contains 224 images with corresponding annotations in 8-bit grayscale format, converts the original annotations to 24-bit RGB format, and randomly splits the dataset into training, validation, and test sets at a ratio of 7:2:1.
Rice seedlings are indicated with green annotations, weeds in red, and water or other backgrounds in black.
Original images and corresponding annotations in rice dataset (green: crops, red: weeds)
Using the convolutional attention mechanism MCA mentioned above to build an MSFCABlock consisting of MCA and FFN networks, we can see that MSFCABlock enhances the functional association between encoder and decoder.
In MSFCABlock, the contact features are processed by 3×3 convolution and batch normalization (BN), which is then connected with the output of the MCA module by using a residual connection, and then processed using a feed-forward network (FFN) with a residual connection.
The FFN structure in MSFCABlock maps the input eigenvectors to a high-dimensional space and then uses the activation function to apply a nonlinear transformation to generate a new eigenvector that contains more information than the original eigenvector.
Global context modeling multilayer perceptron (MLP) and large convolution capture global context features from remote modeling, so that the proposed MSFCABlock can efficiently extract features.
The proposed architecture of multiscale feature convolutional attention block (MSFCABlock).
The proposed overall architecture of MSFCANet can be referred to the figure below, the proposed network consists of an encoder and a decoder, the encoder uses the VGG16 network as the backbone, where the blue block represents the convolutional layer, due to the use of VGG16-based encoders.
Thus the convolutional layer is fully convolutional, with pink blocks representing the maximum pooling layer and green blocks representing the upsampling layer, upsampling using the transpose convolution method, which can learn different parameters for different tasks, making it more flexible compared to other methods such as bilinear interpolation.
The yellow block indicates the concatenation, the brown block indicates the proposed MSFCABlock, and the proposed MSFCABlock combines the characteristics of different layers of the encoder during the decoding process, thus providing excellent and dense contextual information integration for weed segmentation, as well as richer scene understanding.
It enhances the role of multi-scale feature aggregation in network design, and the large convolution with small parameter size also reduces the number of network parameters.
Architecture of the proposed MSFCA-Net
To verify the performance of the proposed MSFCA-Net, the soybean weed dataset was experimented with and the results were compared with other state-of-the-art methods, including FCN, FastFcn, OcrNet, UNet, Segformer, DeeplabV3, and DeeplabV3Plus.
The results show its performance metrics, including those of the proposed MSFCA-Net and the above model based on the soybean weed test set, including MIoU, crop IoU, weed IoU, BgIoU, recall, precision, and F1 score.
The quantitative analysis of the results shows that the proposed MSFCA-Net has efficient performance on the soybean dataset and is better than other models, and the MIoU, crop IoU, WeedIoU, BgIoU, recall, accuracy and F1 scores of this model are 92.64%, 92.64%, 95.34%, 82.97%, 99.62%, 99.57%, 99.54% and 99.55%, respectively, which are better than other models.
In particular, the proposed methods MIoU and WeedIoU are 2.6% and 6% higher than the second-ranked OcrNet, respectively, because the proposed MSFCA-Net, using skipping the connection to learn low-resolution features in the coding stage, semantically maps to high-resolution pixel space, and exhibits high performance in the presence of sample imbalance and hard-to-learn classes.
Comparison of the proposed method with other state-of-the-art methods based on soybean datasets
Therefore, compared with the current popular model, this model has strong advantages in dealing with sample imbalance and learning ability of samples that are difficult to learn.
Partial segmentation results from MSFCA-Net and other methods on the test dataset are shown, where green represents soybeans, red represents weed, black represents background, and labels represent manually annotated images.
The analysis of the prediction results of the three network models shows that the segmentation results of MSFCA-Net are finer and have excellent noise immunity, because MSFCA-Net integrates multi-scale features by using multi-scale convolutional attention mechanism, effectively integrating local information and global context information.
Segmentation results obtained using different methods based on soybean datasets
OcrNet, UNet, and Segformer tend to misclassify the categories in the images and cannot accurately segment soybean seedlings and weeds, FCN, FastFcn, DeeplabV3, and DeeplabVPlus produce segmentation results that reflect the basic morphology of the predicted categories, but with blurred edges and low accuracy.
The segmentation effect of this method is the best, the outline is clear, the details are complete, the image is smooth, and the segmentation result is the closest to manual annotation, which indicates that the MSFCA-Net network model can effectively and accurately segment weeds, soybeans and background in images.
Beet dataset test
The experiment was performed on the beet dataset, which contained a total of 2677 images and 1874 images in the training set, which is relatively large compared to the other datasets used in this work for training eight different models.
The results show that the performance of all other models on the sugar beet dataset is significantly lower compared to the soybean dataset, and although the beet dataset has more training images, the background is more complex and the data collection quality is relatively poor compared to the soybean dataset.
On the other hand, the proposed MSFCA-Net still shows good performance in this challenging situation.
Compared to other models, MSFCA-Net performed well in various indicators, with MIoU, crop IoU and weed IoU of this model at 89.58%, 95.62% and 73.32%, respectively, ahead of the second-ranked OcrNet by 3.5%, 3.4% and 6.8%, respectively.
Comparison of the proposed method with other state-of-the-art methods based on sugar beet datasets
The partial segmentation results of various networks on the test set show that red represents weed, green represents soybean plants, black represents background, "image" refers to the original beet image, and "label" refers to the original annotated image.
Although other networks are able to recognize these categories, they perform poorly in detail and edge profiles compared to the proposed MSFCA-Net.
By comparing the pink boxes in these images, it can be observed that other networks exhibit segmentation errors to varying degrees, which is attributed to their poor performance in handling complex backgrounds.
On the other hand, the proposed MSFCA-Net segmentation results are better, and the classification of sugar beets, weeds and backgrounds is more accurate.
Segmentation results obtained using different methods based on sugar beet datasets
Test on a carrot dataset
In the carrot dataset, there are 60 images, based on a 70% random split, only 42 images were used to train the network, and the prediction results of eight different networks based on the test set were recorded in a table.
Finally, the results show that the model training with few samples, high prediction density per pixel and limited samples is prone to problems such as overfitting and poor segmentation performance.
And the performance of other models is relatively low, indicating that existing models are ineffective in crop and weed segmentation for small datasets with severely scarce samples.
However, compared with the existing model, the proposed model has significantly better performance, and the MIoU, crop IoU and WeedIoU of the proposed model are 79.34%, 59.84% and 79.57%, respectively, which are 4.2%, 1.1% and 10.4% higher than the second ranked OcrNet, which proves that the proposed model has strong learning ability on small sample datasets.
Comparison of the proposed method with other state-of-the-art methods based on carrot datasets
The MSFCA-Net, a multi-scale feature convolution attention network for crop and weed segmentation, is proposed in this work, and designs an attention mechanism for aggregating multi-scale features by using asymmetric large convolution kernels, and adopts skip connections to effectively integrate local and global context information.
The segmentation accuracy of the proposed model is significantly improved, its ability to process details and edge segmentation is enhanced, and a hybrid loss calculation mode combining dice loss and focus loss is also designed.
A separate loss function is designed for crops and weeds, and this mixed loss effectively improves the performance of the proposed model in dealing with class imbalances and enhances its ability to learn from difficult samples.
Compared with other models, its model showed significantly better performance on soybean, beet, carrot and rice datasets, with mIoU scores of 92.64%, 89.58%, 79.34% and 78.12%, respectively, which confirmed its strong generalization ability and ability to handle crop and weed segmentation in complex contexts.
Ablation experiments on the network confirm the proposed model's ability to extract features using asymmetric large convolution kernels and spatial attention.
In addition, the soybean seedlings and weed datasets in the field were collected and manually labeled, which enriched the agricultural weed data dataset and provided rich and effective data for future research, which is of great significance to the development of intelligent weeding and intelligent agriculture.