Author丨AI Vision Engine

Source丨AI Vision Engine

Editor丨Jishi platform

CSUNet: Perfect stitching of Transformer and CNN to reach the pinnacle of UNet family!

Link to the paper: https://arxiv.org/pdf/2308.13917v1.pdf

Transfer learning improves the performance of deep learning models by initializing them with parameters pre-trained on larger datasets. Intuitively, transfer learning is more effective when pre-trained on in-domain datasets. A recent NASA study showed that microstructure segmentation of microscope images is better performed using a CNN encoder pre-trained on a microscope image, rather than using a CNN encoder pre-trained on a natural image. However, CNN models can only capture local spatial relationships in images.

In recent years, attention networks like Transformers have increasingly been used for image analysis to capture long-distance relationships between pixels. In this study, the segmentation performance of Transformer and CNN models pre-trained on microscope images was compared with models pre-trained on natural images. The results partially validate NASA's conclusion that models pre-trained on microscope images significantly improve the segmentation performance of images (images that belong to the outside of the distribution) under different imaging and sample conditions.

However, for One-Shot learning and Few-Shot learning, the performance improvement of using the Transformer is limited. The authors also found that the combination of pre-trained Transformer and CNN encoders consistently outperformed pre-trained CNN encoders alone in image segmentation tasks. The authors' dataset (about 50,000 images) combines the publicly available portion of the NASA dataset with additional images collected by the authors. Even with less training data, the author's pre-trained model performed significantly better in image segmentation.

The results show that Transformer and CNN complement each other and are more beneficial for downstream tasks when they are pre-trained on microscope images.

1. Introduction

Microscopic imaging provides true information about matter, but obtaining quantitative information about morphology, size, and distribution requires manual measurements of microscopy, which is time-consuming, labor-intensive, and bias-prone. The length and time scales of material structures and phenomena vary significantly between components, adding to the complexity. Therefore, establishing the link between process, structure and performance is a challenging problem.

Deep learning (DL) is widely used in complex systems due to its ability to automatically extract important information. Researchers have applied DL algorithms to image analysis to identify structures and determine the relationship between microstructures and performance. DL has been proven to be used in material design, complementing physics-based approaches. However, DL requires a large amount of training data, and a limited number of microscope images tend to reduce its effectiveness.

To make DL suitable for smaller datasets, learning techniques such as transfer learning, multifidelity modeling, and active learning were developed. Transfer learning uses the parameters of a model pretrained on a larger dataset to initialize a model trained on a downstream task on a smaller dataset. For example, a neural network for image segmentation can be initialized using a convolutional neural network (CNN) pre-trained on natural images to improve its accuracy and reduce training time.

However, pre-training with natural images, such as ImageNet, is not ideal because models pre-trained on natural images identify high-level features that are not present in microscopic images. A recent study by Stuckner et al. demonstrated the advantages of pre-training CNNs using a microscope image dataset called MicroNet, which contains more than 110,000 images. They evaluated the accuracy of microscopic image segmentation of nickel-based superalloys (Super) and environmental barrier coatings (EBCs) using a CNN encoder pre-trained with MicroNet. Pre-training with MicroNet significantly improves the accuracy of one-time learning and small-sample learning, as well as out-of-distribution images of different compositions, etchings, and imaging conditions, measured in IoU (cross-union ratio).

In recent years, attention-based neural networks called Transformers have been widely adopted in computer vision. CNNs extract features from local areas of the image and use convolution filters to capture the spatial relationships between pixels, while Transformer divides the image into patches and feeds them into Transformer-based encoders to capture the long-distance relationship of pixels between images. Therefore, the combination of CNN and Transformer may be more effective in transfer learning than using either model alone.

In this paper, the authors evaluate transfer learning for microscopic image segmentation using a combination of CNN and Transformer encoders. The transfer learning method is shown in Figure 1 and includes an encoder-decoder architecture for image segmentation. Each encoder converts the input image into a latent representation vector to extract semantic information. The decoder maps the extracted information back to each pixel in the input image, generating a pixel-level classification of the image.

The authors used a common version of Transformer, Swin-Transformer, specifically its stripped-down version Swin-T, for efficiency. The authors' pre-trained dataset contains about 50,000 microscope images, grouped into 74 classes, which the authors refer to as the MicroLite dataset. Swin-T models can be initialized with the weights of models pre-trained on ImageNet before being fine-tuned on the micro-Lite dataset.

The authors used a CNN model pre-trained on MicroNet to initialize the weights of the blue encoder in Figure 1, while the Swin-T model was used to initialize the weights of the orange encoder and decoder in Figure 1. Before being connected to the decoder, the output of the CNN and the Swin-T encoder is fused together.

To evaluate the segmentation performance of transfer learning, the authors compare the IoU scores of image segmentation on 7 datasets (subsets of Super and EBC) using a model pre-trained on ImageNet only. The authors' results show that although the segmentation accuracy of one-time learning and small-sample learning improved, the improvement was not as significant compared to the effect shown in the NASA paper. For out-of-distribution images, models pre-trained using microscope images are still significantly better than models pre-trained on ImageNet only. The authors also compare the performance of segmentation using CNNs, Swin-T, and a combination of them. The authors' results show that in most cases, combination is superior to CNN alone, and in some cases superior to Swin-T alone.

2. Text method

The authors' goal is to demonstrate that Transformer-based microscope image pre-training models are beneficial for downstream tasks such as image segmentation, and that they are more robust than CNN-based pre-trained models. To this end, the author completed the following tasks.

A microscopic dataset (MicroLite) containing about 50,000 images was collected and preprocessed,
Transformer encoders were pre-trained on MicroLite and used to initialize several Transformer-based segmentation algorithms (Swin-Unet, TransDeeplabv3+, and HiFormer) as well as hybrid segmentation neural networks (CS-UNet) based on CNN and Transformer encoders.
To demonstrate the advantages of CS-UNet, the authors compare the best performance of the CNN-based segmentation algorithm with the Transformer-based segmentation algorithm and CS-UNet. The algorithms were compared using 7 test sets from the NASA team, where the CNN encoder was pre-trained on MicroNet and the Transformer encoder was pre-trained on MicroLite.
To evaluate the advantages of pre-training on domain data, the authors compared the best performance of CS-UNet when pre-trained on ImageNet and MicroLite.
To examine the detailed effect of pretraining on domain data, the authors compare the average performance of CNN-based segmentation algorithms with different pre-training settings. Similarly, the authors compare the average performance of Transformer-based and hybrid segmentation algorithms with different pre-training settings and Transformer architectures.
Finally, to illustrate the robustness of the author's hybrid strategy, the authors compare the performance of three types of segmentation algorithms averaged across all configurations.

2.1 Dataset preprocessing

The images in the authors' MicroLite dataset come from multiple sources, including images of different materials and compounds obtained using different measurement techniques such as light microscopy, scanning electron microscopy (SEM), transmission electron microscopy (TEM), and X-ray. MicroLite brings together images from the Aversa dataset, ultra-high carbon steel micrographs, SEM images from the Materials Data Repository, and images from some recent published authors.

The Aversa dataset includes over 25,000 SEM microscope images in 10 categories, each containing images at different scales (including 1, 2, 10, 20um, and 100, 200nm) and contrast. To properly classify these images, the authors used a pre-trained VGG-16 model to extract feature maps from these images and clustered the feature maps using the K-means algorithm to group images with similar feature mappings into the same category. After the preprocessing step, the authors were given 53 categories. The authors of the Aversa dataset manually classified a small subset of images (1038 images) into a hierarchical dataset, where these 10 categories were further divided into 27 subcategories. The author's classification of these 1038 images is basically consistent with the manually assigned subcategories. It is important to note that since authors work with the entire Aversa dataset, authors have more categories.

In total, MicroLite includes 50,000 microscope images labeled in 74 categories, which were obtained by the following preprocessing steps.

Remove artifacts such as scale bars from the image.
Divide the image into 512×512 pixel tiles, segmenting according to whether the size of the original image overlaps.
Make data enhancements to increase the size of the dataset.
Aggregate raw images, image blocks, and enhanced images to form the final dataset.

2.2. Pre-training

The authors trained the Swin Transformer model to learn feature representations of microscope images in order to migrate them to tasks such as segmentation. The authors evaluated two types of training.

Fine-tune the model pre-trained on ImageNet and then fine-tune it using MicroLite (denoted as ImageNet → MicroLite).
Pre-train the model (denoted as MicroLite) from scratch using MicroLite.

The classification task uses Swin-T, a mini-version of Swin Transformer. Swin-T contains two types of architectures: the original Swin-T, with a Transformer block of [2,2,6,2], and an intermediate network, with a Transformer block of [2,2,2,2].

Figure 2 shows the original architecture of Swin-T. The authors speculate that intermediate networks may be sufficient for microscopic analysis tasks, as earlier layers learn the edges and shapes of corners, middle layers learn textures or patterns, and deeper network layers in the original model learn advanced features such as eye spots and tail attachments. Both the original and intermediate Swin-T models are pre-trained on MicroLite from scratch, where the model weights are randomly initialized. Both models were also pre-trained on ImageNet and fine-tuned on MicroLite.

The pre-training step uses the AdamW optimizer for 30 epochs, the cosine attenuation learning rate scheduler, and the linear warm-up of 5 epochs, with a batch-size of 128. The initial learning rate is , and the weight decay is 0.05. The fine-tuning step is also performed using the AdamW optimizer for 30 epochs with a Batch-Size of 128, but the learning rate is reduced to and the weight decay is reduced to . The model is trained until the validation score no longer improves, using the early stop criterion of 5 Epochs. The training data has been enhanced using the Albumentations library, including random changes in contrast and brightness, vertical and horizontal flips, photometric distortion, and added noise.

For downstream segmentation tasks, multiple models were trained for each task, including Swin-Unet, HiFormer, and TrasDeeplapv3+. The results of these models pre-trained using ImageNet and microscope images were compared and analyzed.

2.3. Combining CNN and Transformer (CS-UNet)

Because CNNs are inherently local, they cannot capture long-range spatial relationships. Transformer was introduced to overcome this limitation. However, Transformer has limitations in capturing low-level features. Studies have shown that for intensive prediction tasks such as segmentation in complex contexts, both local and global information are required.

Some researchers have introduced hybrid models that effectively use CNNs and Transformers for image segmentation. Initializing the weights of CNNs and Transformers in a hybrid model will significantly improve performance. Therefore, the authors introduce a hybrid UNet called CS-UNet, which is a U-shaped segmentation model using CNNs and Transformers. As shown in Figure 3, the method includes an encoder, a bottleneck, a decoder, and a hop connection.

The encoder combines a CNN encoder and a Swin-T encoder, where CNN is used to extract low-level features and Swin-T is used to extract global context features. The Swin-T encoder divides the input image into non-overlapping chunks, applying a self-attention mechanism to capture global dependencies. The encoder captures long-range dependencies and contextual information for the entire image from different scales. Inspired by TFCN (Transformers for Fully Convolutional dense Net) and Lightweight Swin-Unet, the Multilayer Perceptron (MLP) in two consecutive Swin-T blocks was replaced by Residual Multi-Layer Perceptron (ResMLP).

As shown in Figure 4, ResMLP is used to reduce feature loss during transmission and increase the context information extracted by the encoder.

The ResMLP, shown in Figure 5, consists of 2 GELU nonlinear layers, 3 linear layers, and 2 dropout layers. The CNN encoder processes the input image through a series of convolutional layers, gradually reducing the spatial dimension while extracting layered features. In this process, the encoder captures low-level features at the early layers and high-level semantic features at a deeper level.

To fuse information from the two encoders, the hop connection connects the feature maps of the CNN encoder and the Swin-T encoder with the corresponding decoder layer. To ensure compatibility between the feature dimensions of CNNs and Swin-T encoders, the dimensions need to be normalized before fusing them. This is achieved by passing the features obtained from the CNN block through a linear embedding layer that reshapes the feature map from (B, C, H, W) to (B, C, H, ×W), where B, C, H, W are Batch-Size, number of channels, height, and width of the feature map, respectively. The tiled feature map is transposed to swap the last two dimensions, and the result is a shape of shape (B,H× W,C), which is then fused with features extracted from the Swin-T encoder.

By fusing information from different encoder paths, hopping connections enable decoders to benefit from both the local spatial details captured by the CNN encoder and the global context captured by the Swin-T encoder.

The decoder is similar to Swin-Unet's decoder, which uses an extended tile layer to upsample the extracted deep features by reshaping the feature map of adjacent dimensions, effectively achieving a 2× upsample. In addition, it reduces the feature dimension to half of the original dimension. This enables the decoder to reconstruct the output at higher spatial resolution while reducing the feature dimension for efficient processing.

The final extended tile layer is further upsampled by 4 × to restore the resolution of the feature map to match the input resolution (W×H). A linear projection layer is then applied to operate on these upsampled features to generate pixel-level segmentation predictions.

The encoder section can use different CNN families such as EfficientNet, ResNet, MobileNet, DenseNet, VGG, and Inception. The authors used MicroNet to initialize CNN weights and MicroLite to initialize Transformer weights.

3. Results

The pre-trained Swin-T model was used to classify 74 different classes of microscope images. Swin-T models are either pre-trained on ImageNet (specifically, the imageNet1K dataset) and then fine-tuned on MicroLite, or trained on MicroLite with random parameters. Training stops when validation accuracy no longer improves after 5 epochs. Model accuracy was evaluated using TOP-1 and TOP-5 accuracy. Top-1 accuracy measures the percentage of test samples that are correctly labeled as predicted, while top-5 accuracy measures the percentage of correctly labeled in the first five predictions.

As shown in Table 1, Swin-T models trained from scratch take longer to converge. Specifically, the original Swin-T model required 23 Epochs, while the intermediate version required 19 Epochs. In contrast, Swin-T models that were pre-trained by ImageNet and then fine-tuned on MicroLite converged faster. The original Swin-T model required only 13 Epochs, while the intermediate version required 12 Epochs.

On average, models initialized with ImageNet weights converge about 40.16% faster, much faster than randomly initialized models. The original Swin-T model, fine-tuned on MicroLite after ImageNet pre-training, achieved a top-1 accuracy of 84.63%. Overall, Swin-T models pre-trained by ImageNet and fine-tuned on MicroLite have higher accuracy and faster convergence.

4.1. Microscopic image segmentation

To evaluate how the Swin-T model can extract feature representations, the pre-trained model is used to initialize the model for the segmentation task. To compare with the NASA study, the authors used the same 7 microscope datasets derived from two materials, nickel-based superalloys (Super) and environmental barrier coatings (EBC). The EBC dataset has two categories: oxide layer and background (non-oxide) layer, and the Super dataset has three categories: matrix, secondary and tertiary.

The number of images in each dataset segment is shown in Table 2. Super-1 and EBC-1 contain complete datasets of their respective materials. Super-2 and EBC-2 contain only 4 images in the training set to evaluate the performance of the model with a small number of samples. Super-3 and EBC-3 contain only 1 image from the training set to evaluate performance during one learning period. The Super-4 contains test images taken under different imaging and sample conditions.

The EBC and Super datasets are augmented in a similar way to NASA's research and include:

Randomly crop the image to 512×512 pixels, randomly change contrast, brightness, and gamma, and add blur or sharpening.
The EBC dataset was flipped horizontally and the Super dataset was flipped and rotated randomly.
Training uses the Adam optimizer with an initial learning rate until the validation accuracy does not improve within 30 Epochs. After that, the training continued to use the learning rate, triggering an additional 30 Epoch early stops without any validation improvement.
Due to the imbalance of the dataset, the loss function is set to balance the weighted sum of cross-entropy (BCE) and DICE losses, where BCE accounts for 70% of the weight.

The CS-UNet architecture is a flexible model that can be trained using different CNN families and initialized with different pre-trained models. Table 3 shows the various combinations of different pre-training weights used to train the CS-UNet model. The second column shows the pre-training weights that initialize the Swin-T encoder, and the third column shows the pre-training weights that initialize the CNN encoder. In the last column, the authors use the term "microscopy" to refer to cases where CNN encoders are trained using MicroNet and transformer encoders are trained using MicroLite. Other combinations of pre-trained weights can also be used to train CS-Unet models. For example, a Swin-T encoder can be initialized with MicroLite weights, while a CNN encoder can be initialized using ImageNet → MicroNet weights. The flexibility of the CS-UNet architecture allows researchers to experiment with different combinations of pre-trained weights to find the best combination for their specific task.

Table 4 compares the best performance of UNet++/UNet pre-trained on MicroNet, Transformer models pre-trained on MicroLite (including Swin-UNet, TransDeepLabV3+, and HiFormer), and CS-UNet pre-trained on MicroNet and MicroLite. The highest accuracy for each experiment is shown in bold.

In most experiments, CS-UNet performed best, with the exception of EBC-2 and EBC-3. For experiments with sufficient training data, such as Super-1 and EBC-1, the differences between UNet++/UNet, Transformer, and CS-UNet are minimal. For few-shot learning experiments, such as Super-2 and EBC-2, CS-UNet has limited accuracy gains. For one learning experiment, the results were mixed, with CS-UNet having a modest improvement in Super-3 and a significant improvement in EBC-3. For out-of-context learning, CS-UNet shows significant improvements over UNet or Transformer.

Overall, the results in Table 4 show that CS-UNet is a promising method for image segmentation tasks. In all experiments, CS-UNet was similar or significantly better than UNet++/UNet and outperformed Transformer in most experiments. It's worth noting that MicroLite is about half the size of MicroNet. Nevertheless, the performance of Transformer + MicroLite is comparable or better than that of UNet++/UNet + MicroNet. Appendix A shows the configuration of the best-performing Transformer model shown in Table 4. The configuration of the best-performing CS-UNet model is shown in the next section, where the authors will compare the performance of CS-UNet when pre-trained on microscope images and ImageNet.

4.2. Splitting of Tsurugi High Temperature Alloy (Super)

Figures 6 and 7 compare the best performance of CS-UNet on the Super dataset when pre-trained on microscopic images and ImageNet. For Super-1 and Super-2, the IoU scores of the two pre-trained models were similar, while for Super-3, the IoU score of the microscopic model was significantly higher, reaching 93.5%, while the ImageNet model had an IoU score of only 87.01%. This result differs from NASA's study, in which Super-2's performance also improved. It seems that for CS-UNet's enhancement capabilities, the benefits of in-domain datasets are more significant in a single learning than in a small number of learnings. The ImageNet model failed to identify tertiary precipitates in many dark contrast images, as indicated by the orange triangle. The ImageNet model also over-segmented and combined some secondary precipitates, as indicated by the green arrows.

For Super-4, which contains images with different imaging conditions, the microscopic model improved the performance of the ImageNet model from 78.89% to 82.13%, which is consistent with NASA's findings. As shown in Figure 7, the test images for Super-4 come from different image distributions and differ from the training images (Figure 6). The first row shows microscopic images from different alloys. Rows 2 and 3 show microscopic images with different corrosion conditions, and the last row shows poorly imaged microscopic images. Microscopic models are more accurate for separating secondary precipitates with less oversegmentation, and these differences are indicated by green arrows compared to ImageNet models. Microscopic models perform better for secondary and tertiary precipitates that segment images.

4.3. Environmental barrier coating (EBC) segmentation

As shown in Figure 8, the results of the EBC dataset are consistent with NASA's findings, and the IoU scores for EBC-1 and EBC-2 are similar on both the microscope and the ImageNet model. For EBC-3, there is a training image (outlined in red in Figure 8). The IoU score of the best microscopic model is 70.46%, which is much higher than the IoU of 61.74% for the best ImageNet model. This also confirms NASA's findings, although the authors have improved much more. The ImageNet model cannot distinguish between substrates and hot oxide layers, which makes accurate measurement of oxide thickness impossible.

5. Discussion

The authors have shown that CNN and Transformer encoders in CS-UNet provide better segmentation performance than CNN encoders in UNet alone. The authors also show that pretraining on microscope images has a better effect on CS-UNet performance than pretraining on ImageNet, although the degree of improvement differs from that observed on UNet. However, these comparisons are based on the best performing models. Depending on the choice of CNN encoder, Transformer architecture, and pre-trained model, segmentation performance can vary greatly.

In this section, the authors compare the impact of pre-training on the average performance of CNN, Transformer, and hybrid segmentation algorithms. After that, the authors compare the average performance of the three types of segmentation algorithms. The authors' results show that pre-training on microscope images generally has a positive effect on performance. The authors' hybrid algorithm CS-UNet outperformed UNet in all experiments, while performance was similar or better than Transformer-based algorithms in most experiments.

5.1. CNN-based image segmentation

First, the authors examine the performance of UNet [15] under three encoder pre-training modes. The authors include this result because the CS-UNet configuration uses only 19 of the 35 CNN encoders in the NASA paper [1]. As shown in Table 10, these 19 encoders have the accuracy of the first 5 bits in at least one segmentation task. This option reduces the number of experiments that require fair comparisons.

The average performance of UNet when pre-trained with ImageNet or MicroNet is shown in Table 5, showing that ImageNet → MicroNet models (i.e., the CNN encoder initialized to the ImageNet model and fine-tuned on MicroNet) achieved the best results in most cases. The configuration of the best-performing CNN encoder is shown in Table 11, and in most cases, pre-training on MicroNet provides better results.

Not surprisingly, the authors' results largely agree with NASA's findings, as the authors used their pre-trained model on MicroNet. Specifically, pre-training on MicroNet improves one-time learning and out-of-distribution performance. Because the authors chose a CNN encoder with top 5 performers in at least one experiment, the IoU score was higher than the average score shown in the NASA paper. In fact, the performance on Super-2 (less-shot learning) is basically the same as that of different pre-trained models.

5.2. Image segmentation based on Transformer

Next, the authors show in Table 6 the average performance of Transformer-based segmentation algorithms (Swin-UNet, HiFormer, and TransDeepLabv3+) using pre-training and Swin-T architectures using different configurations. The authors compare algorithms by using primitive or intermediate Swin-T architectures and using ImageNet or microscopy pre-trained models. The authors' results show that these algorithms perform well under the MicroLite pre-trained model, while the original Swin-T architecture is slightly better at one-time learning and out-of-distribution learning.

Overall, pre-training on microscope images provides better results for Transformer-based segmentation algorithms compared to pre-training on natural images.

5.3. Hybrid Image Segmentation

The authors also compare the performance of the authors' hybrid segmentation algorithm CS-UNet in Table 7 when it uses the original or intermediate Swin-T architecture and weights initialized from ImageNet or microscopic models. Since CS-UNet uses CNN and Transformer encoders, results vary, and pre-trained microscopic images do not always provide better performance. The weaker performance of the CNN encoder reduces the advantage of the Transformer encoder pre-trained on the microscope image. However, when the authors considered the average IoU score across all experiments, pre-training on microscope images still provided some benefit.

5.4. Comparison of segmented networks

Finally, in Table 8, the authors compare the performance of three types of segmentation algorithms (UNet, CS-UNet, Swin-Unet, HiFormer, and TransDeepLabv3+), whose performance is based on the average performance of different pre-trained models and Swin-T architectures. The results showed that CS-UNet outperformed UNet on average in all experiments.

Although Transformer-based segmentation algorithms may be superior in one-time learning or out-of-distribution learning, their performance is not always consistently superior to UNet. Therefore, the author's hybrid algorithm CS-UNet appears to be a more robust solution, regardless of which pre-trained model is employed.

reference

[1]. Transfer Learning for Microstructure Segmentation with CS-UNet: A Hybrid Algorithm with Transformer and CNN Encoders.

CSUNet: Perfect stitching of Transformer and CNN to reach the pinnacle of UNet family!