SwinIR Practice Guide: Image Recovery Converter

2021-12-06 10:18:02 By : Ms. Alisa Liu

Various transformer designs have become the x-factor for various complex natural language processing and computer vision applications. Contrary to visual or visual effects, you may have encountered a hazy image that is significant to you. Various transformer-based techniques can be used to restore such images, but almost all techniques will not produce the desired results. Therefore, in this article, we will discuss how to use the SwinIR converter to restore or reconstruct an image. The following are the main points to be discussed in this article.

Let's start the discussion by understanding what is image restoration. 

Image restoration technologies, such as image super-resolution (SR), image denoising and JPEG compression artifact reduction, strive to reconstruct high-quality clean images from low-quality degraded images. Image restoration is the process of estimating a clean, original image from a damaged/noisy image. 

Motion blur, noise, and camera out of focus are all examples of damage. Image restoration is done by reversing the blur process. The blur process is done by imaging the point source and using the point source picture (also known as the point spread function (PSF)) to restore the image information lost in the blur process .

Since many pioneering studies, Convolutional Neural Networks (CNN) have become the main tool for image restoration. Most CNN-based methods attach importance to complex architecture design, such as residual learning and dense connections. Although the performance is significantly better than traditional model-based techniques, they are still plagued by two main flaws of the basic convolutional layer. 

First of all, the nature of the interaction between the image and the convolution kernel is irrelevant. Using the same convolution kernel to recover different image parts may not be an ideal solution. Secondly, due to the concept of local processing, convolution is not effective for remote dependency modeling.

Compared with CNN, Transformer uses a self-attention mechanism to capture global interactions across contexts and shows promising results in various visual difficulties. However, the image restoration vision transformer divides the input image into blocks of a predefined size (for example, 48×48) and analyzes each block separately. 

This technique will inevitably produce two shortcomings. First, boundary pixels cannot use neighboring pixels that are not in the patch to restore the image. Second, border artifacts around each patch may be introduced by the restored image. Although patch overlap can help solve this problem, it also increases the computational load.

SwinIR Transformer shows a lot of prospects because it combines the advantages of CNN and Transformer. On the one hand, due to the local attention mechanism, it has the advantages of CNN in processing large-size images and using shift window design, and has the advantages of Transformer in modeling remote dependence. SwinIR is a product based on Swin Transformer.

SwinIR consists of three different modules: shallow feature extraction, deep feature extraction and high-quality image reconstruction. In order to preserve low-frequency information, the shallow feature extraction module uses convolutional layers to extract shallow features, and then immediately transmits them to the reconstruction module. 

The residual Swin Transformer block (RSTB) that constitutes the deep feature extraction module uses many Swin Transformer layers for local attention and cross-window interaction. In addition, we use residual connection to provide a shortcut for feature aggregation, and add a convolutional layer at the end of the block for feature enhancement. Finally, the reconstruction module fuses the shallow and deep features to perform high-quality image reconstruction.

Compared with common CNN-based image restoration models, SwinIR has multiple advantages: the interaction between visual content and content-based attention weights can be regarded as spatially varying convolution. The shifted window method allows long-term dependency modeling. Get better results with fewer parameters

SwinIR consists of three modules: shallow feature extraction, deep feature extraction and high quality (HQ) image reconstruction (as shown in the figure below). For all recovery jobs, it uses the same feature extraction module, but uses different reconstruction modules for various tasks.

SwinIR's architecture (source)

When we provide low-quality input images, it uses three convolutional layers to extract shallow features. The convolutional layer performs well in the early visual processing, the optimization is more stable, and the effect is better. It also provides a direct method of mapping the input image space to a higher-dimensional feature space. The shallow features mainly contain low-frequency content, while the deep features focus on recovering the lost high-frequency content. 

According to the research paper, SwinIR has achieved top-notch performance on six tasks: image super-resolution (classic, lightweight, and real-world), image denoising (grayscale and color image denoising), and JPEG compression artifact removal. In this part, we will understand the effectiveness of SwinIR and compare it with previous SOTA methods such as ESRGAN and BSRGAN.

Now let us see how real-ESRGAN, BSRGAN and SwinIR perform. The code snippets used to visualize the results are available here. 

Through this article, we have seen an overview of image reconstruction technology. Specifically, we have discussed what image restoration is and how it has evolved and solved through various technologies (such as CNN, Transformer, etc.). -ESRGAN. The results of SwinIR and BSRGAN are relatively close.

Vijaysinh is a fan of machine learning and deep learning. He is good at machine learning algorithms, data manipulation, processing and visualization, and model building.

Copyright Analytics India Magazine Pvt Ltd