1 Background

Image Super-Resolution (SR) is a technique that uses algorithms to restore low-resolution (LR) images to high-resolution (HR) images. In reality, due to a variety of reasons, such as sensor resolution, optical system resolution, etc., the cost and difficulty of obtaining high-resolution images are often relatively high, while low-resolution images are easy to obtain. Therefore, how to recover high-resolution images from low-resolution images has always been a research hotspot in the field of computer vision. Ordinary interpolation algorithms, bilinear interpolation, nearest neighbor interpolation, etc., after increasing the image resolution, it is impossible to obtain a clearer image, so how to maintain the clarity of the image while improving the resolution is a key point of image super-resolution reconstruction. In this article, we will introduce image super-resolution reconstruction based on deep learning.

Before introducing the official super-resolution content, it is necessary to clarify a confusing question: what is the difference between image clarity and image super-resolution reconstruction?

Fig.1 Image clarity – improved clarity at the same resolution

Fig.2 Super-resolution reconstruction task—After reconstructing high resolution from low resolution, the resolution remains unchanged or improved

Image Sharpening: Image enhancement transforms images with low clarity and poor visual effects into images with high definition
Image Super-Resolution Reconstruction: Image super-resolution reconstruction is capable of maintaining or image clarity while converting a lower-resolution image into a high-resolution image

2 Algorithm development route

Since the SRCNN algorithm was proposed in 2014, the image super-resolution reconstruction algorithm based on deep learning has gone through five stages:

In the era of SRCNN algorithm, the main problem in the field of superresolution is to explore the basic components of the task, and when ESPCN and other algorithms are proposed one after another, the basic components of image superresolution have been improved, and the framework of superresolution tasks has been constructed. After the proposal of SRResNet, the task architecture of super-resolution tasks has been gradually determined, and with the stabilization of the super-resolution task architecture, the research focus has gradually shifted to two main directions. Firstly, on the one hand, efforts are focused on the continuous improvement of the deep network structure to further extract the rich feature information in the image. On the other hand, researchers are beginning to focus on solving new challenges in the field of super-resolution, namely blind super-resolution tasks and degenerative mode modeling. In this paper, the super-resolution algorithm will be introduced according to the technical development route of super-resolution, so that readers can have an in-depth understanding of the image super-resolution algorithm.

3 Detailed explanation of the algorithm of the paper

In this part, we will select some classic papers from five stages to introduce and discuss the development process of image super-resolution algorithms in a simple way.

3.1 Exploration of the core components of the superresolution task

Figure 3 SRCNN network structure

SRCNN is widely regarded as one of the pioneers in the field of super-resolution, introducing deep learning methods to replace traditional interpolation and filtering techniques. SRCNN is designed to achieve super-resolution of images by learning the mapping from low-resolution to high-resolution images through convolutional neural networks (CNNs). Its basic workflow includes the following key steps:

(1) Interpolation: First, SRCNN scales the input low-resolution image to a resolution that matches the target size by interpolation. This interpolation step helps to provide more information to the network with the details of a high-resolution image.

(2) Feature extraction: The interpolated image is extracted through a series of convolutional layers. The network structure of the SRCNN consists of three main convolutional layers, which are responsible for learning the local and global features of the image and performing nonlinear mapping.

(3) Image reconstruction: After feature extraction, SRCNN realizes image reconstruction through convolutional layers. The purpose of this step is to map a high-resolution, low-definition image to a high-resolution, high-definition image space using the learned features.

SRCNN's contribution is to introduce an end-to-end approach to deep learning, eliminating the complexity of manual feature design and multi-stage processing in traditional methods. However, with the introduction of subsequent algorithms, the performance of SRCNN is gradually surpassed by deeper and more complex network structures. However, SRCNN lays the foundation for the field of image super-resolution and provides useful experience for the subsequent development of algorithms.

Figure 4 VDSR network structure

The main contribution of VDSR is the use of extremely deep network structures to better capture complex features and high-frequency information in images, resulting in more accurate super-resolution results. Here are the key features of VDSR:

Deep network structure: VDSR uses a very deep convolutional neural network structure with 19 convolutional layers. This depth structure is designed to enable the network to learn richer image features, thereby improving super-resolution performance on low-resolution input images.
Residual learning: VDSR uses the idea of residual learning, VDSR abstracts a core assumption of high-resolution graphs, that is, high-resolution graphs = high-resolution low-definition maps + texture residuals, and applies this method to the task architecture.

VDSR's contribution to the field of super-resolution lies in the abstraction of a core assumption: high-resolution map = high-resolution low-definition map + texture residuals, which enables the network to better understand and restore the details in the image. Despite the subsequent emergence of more complex network structures, the success of VDSR is a testament to the effectiveness of increasing network depth to improve super-resolution performance.

Figure 5 ESPCN network structure

ESPCN is designed to improve the efficiency of super-resolution models and make them lighter when working with large-scale image data. The core idea of ESPCN is to achieve efficient upsampling through sub-pixel convolution. Subpixel convolution is a special convolution operation that maps low-resolution feature maps to high-resolution output images. ESPCN's workflow includes the following key steps:

(1) Low-resolution feature extraction: The input low-resolution image goes through a series of convolutional layers to extract the features of the image. These convolutional layers help the network learn feature maps from low to high resolution.

(2) PixelShuffle upsampling: In ESPCN, a subpixel convolutional layer is used to upsample low-resolution feature maps through this layer. This upsampling method is different from the traditional interpolation method and improves the image resolution more efficiently.

(3) Reconstruction: The upsampled feature map is reconstructed by further convolution operations. This step helps to restore richer image detail, resulting in sharper high-resolution images.

The innovation of ESPCN is that it achieves lightweight upsampling through subpixel convolution, which makes the model relatively small in terms of parameter quantity and computational complexity. This is valuable for real-time image super-resolution processing in resource-constrained environments. Although more complex super-resolution algorithms have subsequently emerged, the design concept of ESPCN provides useful experience for the application of lightweight models in image super-resolution tasks.

Figure 6 EDSR

EDSR uses the structure of SRResNet, which is different from some other deep networks, EDSR removes BN, and the author proposes that the BN layer will naturally stretch the color and contrast of the image itself, which will make the output image worse, and the model will be lighter after removing BN. The BN layer consumes as much storage space as the previous CNN layer, and the authors point out that EDSR saves 40% of storage resources by removing the BN layer compared to SRResNet. Insert more CNN-based subnetworks such as residual blocks in the space freed up by BN to increase the expressiveness of the model.

3.2 The construction stage of super-resolution task architecture

Figure 7 SRResNet

In SRResNet, the authors built a complete super-resolution task architecture and divided the super-resolution task into four core components: shallow, deep, upsampled, and reconstructed output layers.

(1) Shallow: The shallow component plays the role of the initial stage of the task, and the preliminary feature extraction is carried out through a large convolution kernel. The purpose of this step is to transform the input image into a feature space, which provides a basic feature representation for subsequent deeper networks, which helps to capture the high-level structure and texture information of the image more effectively.

(2) Deep layer: The deep component is the part of the entire network with the most parameters and the most time-consuming calculation. After shallow feature transformation, the deep network is responsible for further extracting richer and more abstract feature information from the image. The introduction of this depth structure allows the network to gradually understand the complex structure and texture of the image, so as to better perform super-resolution tasks.

(3) Upsampling layer: The upsampling layer component uses techniques such as Pixshuffle to convert channel information into resolution information. This step is critical for mapping feature maps from low to high resolution, and with efficient upsampling, the network is able to preserve and amplify the details of the input image.

(4) Reconstruct the output layer: The reconstructed output layer component converts the high-resolution feature map into the final high-resolution output map by using a large convolutional kernel. This step plays a key role in the overall architecture, determining the quality and clarity of the resulting super-resolution image.

The main contribution of SRResNet is to propose a complete super-resolution mission architecture, which is divided into four key components. In the follow-up super-resolution research, most of the improvement work focuses on the optimization of the deep network structure, aiming to enhance the feature extraction ability of the deep layer. This task architecture provides a solid foundation for the further development of the field of super-resolution, and its ideas and components continue to this day, and play an important guiding role in subsequent algorithms.

Figure 8 LapSRN

In the past, SRResNet-based superresolution algorithms mainly focused on fixed-magnification super-resolution reconstruction tasks, but the uniqueness of LapSRN is that a non-fixed-magnification amplification network architecture is proposed, which can obtain high-resolution maps with multiple magnifications, and the network adopts the Laplace pyramid structure to achieve more comprehensive image super-resolution through the fusion of multi-scale information. Here are the main features of LapSRN and how it works:

Laplace Pyramid Structure: LapSRN uses the Laplace Pyramid, a variant of the image pyramid in which the image at each scale is upsampled from the original image minus the image at the previous scale. This structure allows the network to learn image details at multiple scales to more fully understand and reproduce high-resolution images.
Multi-scale processing: LapSRN decomposes the input image into multiple scales, each of which is processed by the corresponding SR network. This multi-scale design helps the network learn the features of the image at different resolutions, and better process the texture and details in the image.
Training: LapSRN conducts supervised learning from a large number of high-resolution images and their corresponding low-resolution versions. The goal of the network is to minimize the difference between the prediction image and the real high-resolution image, and learn the mapping from low to high resolution through multi-scale structure and feature fusion.
Pyramid reconstruction: Finally, the final high-resolution image reconstruction is achieved by adding the super-resolution results of each scale with the detailed images of the previous scale.

The unique design of LapSRN allows it to handle image super-resolution tasks more comprehensively, resulting in non-fixed magnification magnification of the image, which improves performance to some extent. The introduction of the Laplace pyramid structure provides a rich representation of multi-scale information for the network, and the design of the feature fusion module enhances the network's ability to integrate multi-scale features. These innovative designs make LapSRN one of the most noteworthy algorithms in the field of image super-resolution.

3.3 Adversarial Training Phase

Figure 9 SRGAN network structure

SRGAN introduces adversarial into the super-resolution task, and the paper proposes that super-resolution is an ill-posed problem, because in the actual scene, high-resolution images have undergone a variety of unknown degradation methods to obtain a variety of low-resolution images. In high-resolution images, the definition of clarity is also different, and there are also a variety of distributions, so the image super-resolution task is essentially a many-to-many task. However, the previous super-resolution algorithm has only used pixel loss as a metric, which cannot solve the problem of pathology. Using pixel loss to measure many-to-many problems will eventually result in the prediction structure being like an average of multiple HD images, making the resulting image blurry and not in line with the distribution of true HD images, which the GAN can pull towards the distribution of real HD images. Here are the main features of SRGAN and how it works:

Generative Adversarial Network (GAN): SRGAN is based on a generative adversarial network that consists of a generator and a discriminator. Generators aim to produce high-resolution images, while discriminators try to distinguish between the resulting high-resolution images and real high-resolution images. This adversarial training prompts the generator to produce more realistic and realistic images.
Loss function: SRGAN uses Adversarial Loss as the optimization goal to make the image generated by the generator visually indistinguishable from the real image. In addition, to ensure that the generated images are more structurally fidelity, SRGAN introduces Perceptual Loss, which uses a pre-trained deep network, such as the VGG network, to compare the feature representations between the generated and real images.

The main contribution of SRGAN is to introduce generative adversarial networks into image super-resolution tasks, and generate more realistic and detailed high-resolution images through adversarial training. The introduction of this idea has had a profound impact on the development of the field of super-resolution, and has inspired many other GAN-based super-resolution algorithms in the future.

3.4 Deep network structure optimization stage

Figure 10 RDN network structure

Figure 11 SAN network structure

Figure 12 SwinIR network structure

RDN and other super-resolution algorithms belong to the fourth stage of image super-resolution algorithms, which mainly focus on optimizing the super-resolution architecture in depth. In the fourth stage of the super-resolution period, it is the attention mechanism, Transformer and other network components were proposed, so the super-resolution researchers naturally introduced some new network components into the design of the deep network, in addition to RDN, there are RDN, WDSR, RCAN, SwinIR, IGNN and other algorithms, all of which are optimized and improved for the deep structure.

3.5 Blind super-fractionation and degenerative modeling stage

The image superresolution task is a basic low-level vision problem, and the goal of image superresolution is to reconstruct high-resolution images from its low-resolution observations. At present, a variety of network architectures based on deep learning methods and training strategies for super-resolution networks have been proposed to improve performance. The task requires two images, a high-resolution HR plot and a low-resolution LR plot. The purpose of a superresolution model is to generate the former from the latter, whereas the purpose of a degenerative model is to generate the latter from the former. The classical superresolution task assigns that the low-resolution LR map is obtained by some kind of degradation of the high-resolution HR map, and this degenerative kernel is preset as a bicubic downsampling blur kernel.

That is, this downsampled fuzzy kernel is predefined. However, in practice, this degradation is complex, the expression is unknown, and it is difficult to model simply. There is a domain difference between the bicubic downsampled training samples and the real image. In the practical application of the network trained by bicubic downsampling as fuzzy kernel, this domain gap will lead to poor performance. This kind of super-resolution task unknown to the degraded nuclear is called a blind super-resolution task.

盲图像超分辨率主要分为显示建模（Explicit Modelling）和隐式建模（Implicit Modelling）两类方法。

Explicit modeling schemes typically use a classical degradation model, i.e., a generalized degradation that, as shown in the formula, is more complex than bicubic downsampling:

The implicit modeling method is quite different from display modeling in that it does not rely on any explicit parameters, but uses additional data to implicitly learn the latent super-resolution model through the data distribution, and the existing methods often use GAN frameworks to explore the data distribution

Fig.13 Degradation process in Real ESRGAN

Real-ESRGAN proposes a high-order degradation process to simulate actual degradation, and Real-ESRGAN, trained on pure synthetic data, is able to restore most real-world images and achieve better visual representation than previous works, making it more practical in practical applications. When the above classical degenerate model is used to synthesize the training pair, the trained model can indeed process some real samples. The paper proposes that when using fixed process degradation methods such as interpolation and blurring, it still cannot solve some complex degradation problems in the real world, especially unknown noise and complex artifacts, because there is still a large gap between the synthesized low-resolution image and the realistic degraded image. Therefore, the classical degradation model is extended to the higher-order degradation process to simulate more realistic degradation.

Figure 14 Degradation process in PDM-SR

In this paper, the authors study the degradation function as a random variable and model its distribution as a joint distribution of fuzzy kernel and random noise. The proposed Probabilistic Degradation Model (PDM) can better decouple the degradation from the image content. PDM can generate HR-LR training samples with greater diversity of degradation compared to the previous degenerative model, which can generate a greater variety of degradation effects and help improve the performance of the SR model on test images. In addition, PDM provides a flexible degradation effect that can be adjusted according to different actual situations.

Figure 14 The training method of PDM-SR

PDM also proposes a unified training framework, as shown in the figure, which can be trained with SR models, so that PDM can be integrated with any SR model to form a unified framework for Blind SR.

4. The idea of super-resolution is combined with the task of cutout

In the Matting task, the algorithm needs to have two key capabilities, namely semantic understanding and high-resolution fine-grained processing. In the past, in order to solve the problem of fine cutout in high-resolution tasks, a network structure with both input and output of high-resolution images was usually used. However, this practice causes a huge drain on training and inference resources.

Based on this understanding, we split the task into two key parts: one is the low-resolution semantic basic network, which is used to solve the problem of explicit subject, and the other is the Refine network for super-resolution reconstruction and detail repair, which is used to solve the problem of super-resolution reconstruction and fineness. With this split, we were able to handle Matting's tasks efficiently and reduce the need for resources. In the design of the Refine network, we borrowed the network structure design of LapSRN and adopted a progressive recovery strategy. This strategy allows the network to process the detailed information of the image step by step, so as to effectively fix the subtle problems of the image while maintaining high resolution. This design not only improves the accuracy of Matting's tasks, but also further reduces the strain on computing resources.

5 Summary

The development of image super-resolution reconstruction tasks has gone through the advancement from the design of basic components to the construction of task architecture, and now the use of blind super-resolution and degradation modeling to more realistically simulate the distribution of low-resolution maps in real scenes. Image super-resolution reconstruction algorithms play an indispensable role in the application scenarios of hippocampal image algorithms, although they are not directly used as the basic task architecture of hippocampus algorithm application scenarios, but the use of components in super-resolution is helpful to realize high-resolution schemes under the condition of limited computing resources.

Therefore, it is important to have an in-depth understanding and application of image super-resolution reconstruction, and through the introduction of this article, I hope to help readers establish a deep understanding of this technology, and provide a solid foundation for its flexible application in practical work.

Author of this article

Sisyphus, from the algorithm team of Mantu Internet Center.

Source-WeChat public account: Mantu coder

Source: https://mp.weixin.qq.com/s/mcPTBK0TavHr55QlcRAqTQ