# represent真假鉴别,[论文速览] CVPR 2020 那些有趣的图像超分辨算法（9篇）（1/2）

[论文速览] CVPR 2020 那些有趣的图像超分辨算法（共9篇）（1/2）

[论文速览] CVPR 2020 那些重要的图像超分辨算法（共9篇）（2/2）【持续更新中】

————————  第一期  ————————

[论文速览] CVPR 2020 那些有趣的图像超分辨算法（共9篇）（1/2）

Unpaired Image Super-Resolution Using Pseudo-Supervision

[pdf] [supp] [bibtex]

Abstract

Loss Functions

Network Architecture

Structure-Preserving Super Resolution With Gradient Guidance

[pdf] [supp] [bibtex]

Abstract

Details in Architecture

Objective Functions

Learning Texture Transformer Network for Image Super-Resolution

[pdf] [supp] [bibtex]

Abstract

Texture Transformer

Cross-Scale Feature Integration

Loss Function

Deep Unfolding Network for Image Super-Resolution

[pdf] [bibtex]

Meta-Transfer Learning for Zero-Shot Super-Resolution

[pdf] [supp] [bibtex]

————————  第二期  ————————

Closed-Loop Matters: Dual Regression Networks for Single Image Super-Resolution

[pdf] [supp] [bibtex]

Residual Feature Aggregation Network for Image Super-Resolution

[pdf] [supp] [bibtex]

Correction Filter for Single Image Super-Resolution: Robustifying Off-the-Shelf Deep Super-Resolvers

[pdf] [supp] [bibtex]

Image Super-Resolution With Cross-Scale Non-Local Attention and Exhaustive Self-Exemplars Mining

[pdf] [supp] [bibtex]

1

Unpaired Image Super-Resolution Using Pseudo-Supervision  [pdf] [supp] [bibtex] Abstract

In most studies on learning-based image super-resolution (SR), the paired training dataset is created by downscaling high-resolution (HR) images with a predetermined operation (e.g., bicubic). However, these methods fail to super-resolve real-world low-resolution (LR) images, for which the degradation process is much more complicated and unknown.

Motivation：传统算法没有很好针对现实世界的低分辨率，其降解过程是复杂和未知的。

In this paper, we propose an unpaired SR method using a generative adversarial network that does not require a paired/aligned training dataset. Our network consists of an unpaired kernel/noise correction network and a pseudo-paired SR network. The correction network removes noise and adjusts the kernel of the inputted LR image; then, the corrected clean LR image is upscaled by the SR network. In the training phase, the correction network also produces a pseudo-clean LR image from the inputted HR image, and then a mapping from the pseudo-clean LR image to the inputted HR image is learned by the SR network in a paired manner. Because our SR network is independent of the correction network, well-studied existing network architectures and pixel-wise loss functions can be integrated with the proposed framework.

1. 提出一种不需要配对/对齐训练数据集的生成对抗网络的非配对SR方法

2. 网络结构：unpaired kernel/noise correction network（非配对核/噪声校正网络）和 pseudo-paired SR network（伪配对SR网络）。

correction network：去除噪音并调整输入 LR 图像的内核；在训练系统时， 校正网络从输入的 HR 图像中生成一个伪干净 LR 图像；

SR network：放大修正后的清洁 LR 图像；在训练系统时，SR 网络通过配对学习伪干净 LR 图像到输入HR图像的映射。

Figure 3: Data-flow diagram of proposed method. SR network $U_{Y_{\downarrow }Y}$ can be learned in a paired manner through $\mathcal{L}_{rec}$, even if the training dataset $\{X, Y \}$ is not paired. The whole network is end-to-end trainable.

Experiments on diverse datasets show that the proposed method is superior to existing solutions to the unpaired SR problem.

Loss Functions

1.   (图3，中间的那个 GAN)

2.      (图3，右边的那个 GAN) （公式中，圆圈的含义是

Cycle consistency loss

The normal CycleGAN learns one-to-one mappings because it imposes cycle consistency on both cycles (i.e.,$X \rightarrow Y \rightarrow X$ and$Y \rightarrow X \rightarrow Y$).

Identity mapping loss

Geometric ensemble loss

operators represent eight distinct patterns of flip and rotation （八种不同的翻转和旋转模式）.

Full objective

Network Architecture

：看几个关键词，就大概知道结构了。

The RCAN consists of 10 residual groups (RGs), where each RG contains 20 residual channel attention blocks (RCABs).

Our GXY↓ (UY↓Y ) is a reduced version of the RCAN consisting of five RGs with 10 (20) RCABs.

RCAN : Image super-resolution using very deep residual channel attention networks

use several residual blocks with 5×5 filters and several fusion layers with 1×1 filters, where each convolution layer is followed by batch normalization (BN) [16] and LeakyReLU.

---------------------------------------------------------------------

Figure 4: Intermediate images of proposed method. x is image “0886” from the DIV2K realistic-wild validation set, and y is image “0053” from the DIV2K training ground-truth set.

2.

Structure-Preserving Super Resolution With Gradient Guidance [pdf] [supp] [bibtex] Abstract

Structures matter in single image super resolution (SISR). Recent studies benefiting from generative adversarial network (GAN) have promoted the development of SISR by recovering photo-realistic images. However, there are always undesired structural distortions in the recovered images.

motivation：在恢复的图像中，总是存在着不希望看到的结构畸变

In this paper, we propose a structure-preserving super resolution method to alleviate the above issue while maintaining the merits of GAN-based methods to generate perceptual-pleasant details. Specifically, we exploit gradient maps of images to guide the recovery in two aspects. On the one hand, we restore high-resolution gradient maps by a gradient branch to provide additional structure priors for the SR process. On the other hand, we propose a gradient loss which imposes a second-order restriction on the super-resolved images. Along with the previous image-space loss functions, the gradient-space objectives help generative networks concentrate more on geometric structures. Moreover, our method is model-agnostic, which can be potentially used for off-the-shelf SR networks.

1. 通过一个梯度分支来恢复高分辨率的梯度地图，为SR过程提供额外的结构先验；

2. 提出了一种梯度损失，它对超分辨图像施加二阶限制。

Experimental results show that we achieve the best PI and LPIPS performance and meanwhile comparable PSNR and SSIM compared with state-of-the-art perceptual-driven SR methods. Visual results demonstrate our superiority in restoring structures while generating natural SR images.

Details in Architecture

Figure 2. Overall framework of our SPSR method. Our architecture consists of two branches, the SR branch and the gradient branch. The gradient branch aims to super-resolve LR gradient maps to the HR counterparts. It incorporates multi-level representations from the SR branch to reduce parameters and outputs gradient information to guide the SR process by a fusion block in turn. The final SR outputs are optimized by not only conventional image-space losses, but also the proposed gradient-space objectives.

Structure-Preserving SR Branch

Objective Functions pixelwise loss

Perceptual loss

where φi(.) denotes the ith layer output of the VGG model.

Figure 3. An illumination of a simple 1-D case. The first row shows the pixel sequences and the second row shows their corresponding gradient maps

pixelwise loss

3.

Learning Texture Transformer Network for Image Super-Resolution [pdf] [supp] [bibtex] Abstract

We study on image super-resolution (SR), which aims to recover realistic textures from a low-resolution (LR) image. Recent progress has been made by taking high-resolution images as references (Ref), so that relevant textures can be transferred to LR images. However, existing SR approaches neglect to use attention mechanisms to transfer high-resolution (HR) textures from Ref images, which limits these approaches in challenging cases.

motivition: 现有的 SR 方法忽略了使用注意力机制从参考图像 Ref 转移高分辨率 HR 纹理

In this paper, we propose a novel Texture Transformer Network for Image Super-Resolution (TTSR), in which the LR and Ref images are formulated as queries and keys in a transformer, respectively. TTSR consists of four closely-related modules optimized for image generation tasks, including a learnable texture extractor by DNN, a relevance embedding module, a hard-attention module for texture transfer, and a soft-attention module for texture synthesis. Such a design encourages joint feature learning across LR and Ref images, in which deep feature correspondences can be discovered by attention, and thus accurate texture features can be transferred. The proposed texture transformer can be further stacked in a cross-scale way, which enables texture recovery from different levels (e.g., from 1x to 4x magnification).

Figure 2. The proposed texture transformer. Q, K and V are the texture features extracted from an up-sampled LR image, a sequentially down/up-sampled Ref image, and an original Ref image, respectively. H and S indicate the hard/soft attention map, calculated from relevance embedding. F is the LR features extracted from a DNN backbone, and is further fused with the transferred texture features T for generating the SR output.

Extensive experiments show that TTSR achieves significant improvements over state-of-the-art approaches on both quantitative and qualitative evaluations.

Texture Transformer

In Figure 2, LR, LR↑ and Ref represent the input image, the 4× bicubic-upsampled input image and the reference image, respectively.

There are four parts in the texture transformer: the learnable texture extractor (LTE), the relevance embedding module (RE), the hard-attention module for feature transfer (HA) and the soft-attention module for feature synthesis (SA).

Learnable Texture Extractor

We design a learnable texture extractor whose parameters will be updated during end-to-end training. Such a design encourages a joint feature learning across the LR and Ref image, in which more accurate texture features can be captured. The process of texture extraction can be expressed as:

where LT E(·) denotes the output of our learnable texture extractor. The extracted texture features, Q (query), K (key), and V (value) indicate three basic elements of the attention mechanism inside a transformer and will be further used in our relevance embedding module.

Relevance Embedding

Relevance embedding aims to embed the relevance between the LR and Ref image by estimating the similarity between Q and K. We unfold both Q and K into patches , denoted as $q_i (i \in [1, H_{LR }\times W_{LR}]) ~and~ k_j (j \in [1, H_{Ref }\times W_{Ref }])$. Then for each patch $q_i$ in Q and $k_j$ in K, we calculate the relevance $r_{i,j}$ between these two patches by normalized inner product:

Hard-Attention

We propose a hard-attention module to transfer the HR texture features V from the Ref image. Traditional attention mechanism takes a weighted sum of V for each query $q_i$ . However, such an operation may cause blur effect which lacks the ability of transferring HR texture features. Therefore, in our hard-attention module, we only transfer features from the most relevant position in V for each query $q_i$ .

More specifically, we first calculate a hard-attention map H in which the i-th element $h_i (i \in [1, H_{LR} \times W_{LR}])$ is calculated from the relevance $r_{i,j}$ :

The value of $h_i$ can be regarded as a hard index, which represents the most relevant position in the Ref image to the $i$-th position in the LR image. To obtain the transferred HR texture features T from the Ref image, we apply an index selection operation to the unfolded patches of V using the hard-attention map as the index:

where $t_i$ denotes the value of T in the i-th position, which is selected from the $h_i$-th position of V . As a result, we obtain a HR feature representation T for the LR image which will be further used in our softattention module.

Soft-Attention

We propose a soft-attention module to synthesize features from the transferred HR texture features T and the LR features F of the LR image from a DNN backbone. During the synthesis process, relevant texture transfer should be enhanced while the less relevant ones should be relived. To achieve that, a soft-attention map S is computed from $r_{i,j}$ to represent the confidence of the transferred texture features for each position in T:

where $s_i$ denotes the $i$-th position of the soft-attention map S. Instead of directly applying the attention map S to T, we first fuse the HR texture features T with the LR features F to leverage more information from the LR image. Such fused features are further element-wisely multiplied by the soft-attention map S and added back to F to get the final output of the texture transformer. This operation can be represented as:

where $F_{out }$ indicates the synthesized output features. Conv and Concat represent a covolutional layer and Concatenation operation, respectively. The operator $\odot$ denotes element-wise multiplication between feature maps.

$s_i$ 很有意思，它是取 $r_{i,j}$ 中第 $j$ 列的最大值，而 $h_i$ 是取 $s_i$ 所在的列 $j$

Cross-Scale Feature Integration

Our texture transformer can be further stacked in a crossscale way with a cross-scale feature integration module. The architecture is shown in Figure 3. Stacked texture transformers output the synthesized features for three resolution scales (1×, 2× and 4×), such that the texture features of different scales can be fused into the LR image. To learn a better representation across different scales, inspired by [25, 37], we propose a cross-scale feature integration module (CSFI) to exchange information among the features of different scales. A CSFI module is applied each time the LR feature is up-sampled to the next scale. For the each scale inside the CSFI module, it receives the exchanged features from other scales by up/down-sampling, followed by a concatenation operation in the channel dimension. Then a convolutional layer will map the features into the original number of channels. In such a design, the texture features transferred from the stacked texture transformers are exchanged across each scale, which achieves a more powerful feature representation. This cross-scale feature integration module further improves the performance of our approach.

Figure 3. Architecture of stacking multiple texture transformers in a cross-scale way with the proposed cross-scale feature integration module (CSFI). RBs indicates a group of residual blocks.

Loss Function

Reconstruction loss

Perceptual loss

Deep Unfolding Network for Image Super-Resolution [pdf] [bibtex]

Abstract

Learning-based single image super-resolution (SISR) methods are continuously showing superior effectiveness and efficiency over traditional model-based methods, largely due to the end-to-end training. However, different from model-based methods that can handle the SISR problem with different scale factors, blur kernels and noise levels under a unified MAP (maximum a posteriori) framework, learning-based methods generally lack such flexibility. To address this issue, this paper proposes an end-to-end trainable unfolding network which leverages both learningbased methods and model-based methods. Specifically, by unfolding the MAP inference via a half-quadratic splitting algorithm, a fixed number of iterations consisting of alternately solving a data subproblem and a prior subproblem can be obtained. The two subproblems then can be solved with neural modules, resulting in an end-to-end trainable, iterative network. As a result, the proposed network inherits the flexibility of model-based methods to super-resolve blurry, noisy images for different scale factors via a single model, while maintaining the advantages of learning-based methods. Extensive experiments demonstrate the superiority of the proposed deep unfolding network in terms of flexibility, effectiveness and also generalizability.

Meta-Transfer Learning for Zero-Shot Super-Resolution [pdf] [supp] [bibtex]

Abstract

Convolutional neural networks (CNNs) have shown dramatic improvements in single image super-resolution (SISR) by using large-scale external samples. Despite their remarkable performance based on the external dataset, they cannot exploit internal information within a specific image. Another problem is that they are applicable only to the specific condition of data that they are supervised. For instance, the low-resolution (LR) image should be a "bicubic" downsampled noise-free image from a high-resolution (HR) one. To address both issues, zero-shot super-resolution (ZSSR) has been proposed for flexible internal learning. However, they require thousands of gradient updates, i.e., long inference time. In this paper, we present Meta-Transfer Learning for Zero-Shot Super-Resolution (MZSR), which leverages ZSSR. Precisely, it is based on finding a generic initial parameter that is suitable for internal learning. Thus, we can exploit both external and internal information, where one single gradient update can yield quite considerable results. With our method, the network can quickly adapt to a given image condition. In this respect, our method can be applied to a large spectrum of image conditions within a fast adaptation process.

Closed-Loop Matters: Dual Regression Networks for Single Image Super-Resolution [pdf] [supp] [bibtex]

Abstract

Deep neural networks have exhibited promising performance in image super-resolution (SR) by learning a nonlinear mapping function from low-resolution (LR) images to high-resolution (HR) images. However, there are two underlying limitations to existing SR methods. First, learning the mapping function from LR to HR images is typically an ill-posed problem, because there exist infinite HR images that can be downsampled to the same LR image. As a result, the space of the possible functions can be extremely large, which makes it hard to find a good solution. Second, the paired LR-HR data may be unavailable in real-world applications and the underlying degradation method is often unknown. For such a more general case, existing SR models often incur the adaptation problem and yield poor performance. To address the above issues, we propose a dual regression scheme by introducing an additional constraint on LR data to reduce the space of the possible functions. Specifically, besides the mapping from LR to HR images, we learn an additional dual regression mapping estimates the down-sampling kernel and reconstruct LR images, which forms a closed-loop to provide additional supervision. More critically, since the dual regression process does not depend on HR images, we can directly learn from LR images. In this sense, we can easily adapt SR models to real-world data, e.g., raw video frames from YouTube. Extensive experiments with paired training data and unpaired real-world data demonstrate our superiority over existing methods.

Residual Feature Aggregation Network for Image Super-Resolution [pdf] [supp] [bibtex]

Abstract

Recently, very deep convolutional neural networks (CNNs) have shown great power in single image super-resolution (SISR) and achieved significant improvements against traditional methods. Among these CNN-based methods, the residual connections play a critical role in boosting the network performance. As the network depth grows, the residual features gradually focused on different aspects of the input image, which is very useful for reconstructing the spatial details. However, existing methods neglect to fully utilize the hierarchical features on the residual branches. To address this issue, we propose a novel residual feature aggregation (RFA) framework for more efficient feature extraction. The RFA framework groups several residual modules together and directly forwards the features on each local residual branch by adding skip connections. Therefore, the RFA framework is capable of aggregating these informative residual features to produce more representative features. To maximize the power of the RFA framework, we further propose an enhanced spatial attention (ESA) block to make the residual features to be more focused on critical spatial contents. The ESA block is designed to be lightweight and efficient. Our final RFANet is constructed by applying the proposed RFA framework with the ESA blocks. Comprehensive experiments demonstrate the necessity of our RFA framework and the superiority of our RFANet over state-of-the-art SISR methods.

Correction Filter for Single Image Super-Resolution: Robustifying Off-the-Shelf Deep Super-Resolvers [pdf] [supp] [bibtex]

Abstract

The single image super-resolution task is one of the most examined inverse problems in the past decade. In the recent years, Deep Neural Networks (DNNs) have shown superior performance over alternative methods when the acquisition process uses a fixed known downscaling kernel---typically a bicubic kernel. However, several recent works have shown that in practical scenarios, where the test data mismatch the training data (e.g. when the downscaling kernel is not the bicubic kernel or is not available at training), the leading DNN methods suffer from a huge performance drop. Inspired by the literature on generalized sampling, in this work we propose a method for improving the performance of DNNs that have been trained with a fixed kernel on observations acquired by other kernels. For a known kernel, we design a closed-form correction filter that modifies the low-resolution image to match one which is obtained by another kernel (e.g. bicubic), and thus improves the results of existing pre-trained DNNs. For an unknown kernel, we extend this idea and propose an algorithm for blind estimation of the required correction filter. We show that our approach outperforms other super-resolution methods, which are designed for general downscaling kernels.

Image Super-Resolution With Cross-Scale Non-Local Attention and Exhaustive Self-Exemplars Mining  [pdf] [supp] [bibtex]

Abstract

Deep convolution-based single image super-resolution (SISR) networks embrace the benefits of learning from large-scale external image resources for local recovery, yet most existing works have ignored the long-range feature-wise similarities in natural images. Some recent works have successfully leveraged this intrinsic feature correlation by exploring non-local attention modules. However, none of the current deep models have studied another inherent property of images: cross-scale feature correlation. In this paper, we propose the first Cross-Scale Non-Local (CS-NL) attention module with integration into a recurrent neural network. By combining the new CS-NL prior with local and in-scale non-local priors in a powerful recurrent fusion cell, we can find more cross-scale feature correlations within a single low-resolution (LR) image. The performance of SISR is significantly improved by exhaustively integrating all possible priors. Extensive experiments demonstrate the effectiveness of the proposed CS-NL module by setting new state-of-the-arts on multiple SISR benchmarks.