represent真假鉴别,[论文速览] CVPR 2020 那些有趣的图像超分辨算法（9篇）（1/2）
关键词：Unpaired; Pseudo-Supervision; Gradient Guidance; Texture Transformer Network; Deep Unfolding Network; Meta-Transfer; Zero-Shot; Super-Resolution
本文以速览形式，带领大家大概了解一下 CVPR2020 那些有趣（重要）的 SR 文章，目的是快速了解 SR 的最新动向（解决什么问题，采用什么模型）。
[论文速览] CVPR 2020 那些重要的图像超分辨算法（共9篇）（2/2）【持续更新中】
———————— 第一期 ————————
[论文速览] CVPR 2020 那些有趣的图像超分辨算法（共9篇）（1/2）
Unpaired Image Super-Resolution Using Pseudo-Supervision
[pdf] [supp] [bibtex]
Structure-Preserving Super Resolution With Gradient Guidance
[pdf] [supp] [bibtex]
Details in Architecture
Learning Texture Transformer Network for Image Super-Resolution
[pdf] [supp] [bibtex]
Cross-Scale Feature Integration
Deep Unfolding Network for Image Super-Resolution
Meta-Transfer Learning for Zero-Shot Super-Resolution
[pdf] [supp] [bibtex]
———————— 第二期 ————————
Closed-Loop Matters: Dual Regression Networks for Single Image Super-Resolution
[pdf] [supp] [bibtex]
Residual Feature Aggregation Network for Image Super-Resolution
[pdf] [supp] [bibtex]
Correction Filter for Single Image Super-Resolution: Robustifying Off-the-Shelf Deep Super-Resolvers
[pdf] [supp] [bibtex]
Image Super-Resolution With Cross-Scale Non-Local Attention and Exhaustive Self-Exemplars Mining
[pdf] [supp] [bibtex]
1Unpaired Image Super-Resolution Using Pseudo-Supervision [pdf] [supp] [bibtex] Abstract
In most studies on learning-based image super-resolution (SR), the paired training dataset is created by downscaling high-resolution (HR) images with a predetermined operation (e.g., bicubic). However, these methods fail to super-resolve real-world low-resolution (LR) images, for which the degradation process is much more complicated and unknown.
In this paper, we propose an unpaired SR method using a generative adversarial network that does not require a paired/aligned training dataset. Our network consists of an unpaired kernel/noise correction network and a pseudo-paired SR network. The correction network removes noise and adjusts the kernel of the inputted LR image; then, the corrected clean LR image is upscaled by the SR network. In the training phase, the correction network also produces a pseudo-clean LR image from the inputted HR image, and then a mapping from the pseudo-clean LR image to the inputted HR image is learned by the SR network in a paired manner. Because our SR network is independent of the correction network, well-studied existing network architectures and pixel-wise loss functions can be integrated with the proposed framework.
2. 网络结构：unpaired kernel/noise correction network（非配对核/噪声校正网络）和 pseudo-paired SR network（伪配对SR网络）。
correction network：去除噪音并调整输入 LR 图像的内核；在训练系统时， 校正网络从输入的 HR 图像中生成一个伪干净 LR 图像；
SR network：放大修正后的清洁 LR 图像；在训练系统时，SR 网络通过配对学习伪干净 LR 图像到输入HR图像的映射。
Figure 3: Data-flow diagram of proposed method. SR network can be learned in a paired manner through , even if the training dataset is not paired. The whole network is end-to-end trainable.
Experiments on diverse datasets show that the proposed method is superior to existing solutions to the unpaired SR problem.
公式中的符号可以从图 3 中找对应。Adversarial loss
1. (图3，中间的那个 GAN)
2. (图3，右边的那个 GAN) （公式中，圆圈的含义是 ）Cycle consistency loss
The normal CycleGAN learns one-to-one mappings because it imposes cycle consistency on both cycles (i.e., and).
对两个生成器： 和 的学习。Identity mapping loss
对生成器 的学习。Geometric ensemble loss
几何一致性是在最近的作品 [Geometryconsistent generative adversarial networks for one-sided unsupervised domain mapping] 中引入的，它减少了可能的平移空间以保存场景几何。
operators represent eight distinct patterns of flip and rotation （八种不同的翻转和旋转模式）.Full objective
The RCAN consists of 10 residual groups (RGs), where each RG contains 20 residual channel attention blocks (RCABs).
Our GXY↓ (UY↓Y ) is a reduced version of the RCAN consisting of five RGs with 10 (20) RCABs.
RCAN : Image super-resolution using very deep residual channel attention networks
use several residual blocks with 5×5 filters and several fusion layers with 1×1 filters, where each convolution layer is followed by batch normalization (BN)  and LeakyReLU.
Figure 4: Intermediate images of proposed method. x is image “0886” from the DIV2K realistic-wild validation set, and y is image “0053” from the DIV2K training ground-truth set.
2.Structure-Preserving Super Resolution With Gradient Guidance [pdf] [supp] [bibtex] Abstract
Structures matter in single image super resolution (SISR). Recent studies benefiting from generative adversarial network (GAN) have promoted the development of SISR by recovering photo-realistic images. However, there are always undesired structural distortions in the recovered images.
In this paper, we propose a structure-preserving super resolution method to alleviate the above issue while maintaining the merits of GAN-based methods to generate perceptual-pleasant details. Specifically, we exploit gradient maps of images to guide the recovery in two aspects. On the one hand, we restore high-resolution gradient maps by a gradient branch to provide additional structure priors for the SR process. On the other hand, we propose a gradient loss which imposes a second-order restriction on the super-resolved images. Along with the previous image-space loss functions, the gradient-space objectives help generative networks concentrate more on geometric structures. Moreover, our method is model-agnostic, which can be potentially used for off-the-shelf SR networks.
Experimental results show that we achieve the best PI and LPIPS performance and meanwhile comparable PSNR and SSIM compared with state-of-the-art perceptual-driven SR methods. Visual results demonstrate our superiority in restoring structures while generating natural SR images.
Details in Architecture
Figure 2. Overall framework of our SPSR method. Our architecture consists of two branches, the SR branch and the gradient branch. The gradient branch aims to super-resolve LR gradient maps to the HR counterparts. It incorporates multi-level representations from the SR branch to reduce parameters and outputs gradient information to guide the SR process by a fusion block in turn. The final SR outputs are optimized by not only conventional image-space losses, but also the proposed gradient-space objectives.Gradient Branch
图 2 中的 M() 函数是表示提取梯度映射的操作，是这样计算的：
通过梯度分支得到 SR 梯度映射后，就可以将得到的梯度特征整合到 SR 分支中，依次指导 SR 重构。
在实践中，将梯度分支的下一层生成的特征图馈送到SR分支。同时，将这些特征图作为输入，通过1x1卷积层生成梯度图输出。Structure-Preserving SR Branch
第一部分是由多个生成神经块组成的规则 SR 网络，可以是任何结构。
本文介绍了在 ESRGAN  中提出的残差稠密块 (RRDB) 中的残差。原始模型中有23个 RRDB 块。因此，将第5、10、15、20个 block 的 feature maps 合并到 gradient branch 中。由于常规的SR模型生成的图像只有 3 个通道，所以我们去掉最后一个卷积重建层，将输出特征输入到连续的部分。SR 分支的第二部分连接从上面提到的梯度分支得到的SR梯度特征图。我们通过一个融合块将两个分支的特征融合在一起来融合结构信息。
Objective Functions pixelwise loss
where φi(.) denotes the ith layer output of the VGG model.adversarial loss
Figure 3. An illumination of a simple 1-D case. The first row shows the pixel sequences and the second row shows their corresponding gradient maps
图 3 清楚地说明了动机。这里只考虑一个简单的一维情况。真实（HR）边缘是图 3 (a)，超分辨（SR）的边缘是图 3 (b)。 如果模型只有在图像空间优化 L1 损失，模型未能恢复锐利边缘，原因是模型往往会给一个从训练数据统计平均的 HR 解。在这种情况下，如果计算和显示两个序列的梯度大小，可以观察到 SR 梯度是平缓的，数值较低，而 HR 梯度是一个尖峰，数值较高。
这启发了我们，如果我们在优化目标上增加一个二阶梯度约束，模型可以从梯度空间学到更多。它使模型聚焦于相邻配置（neighboring configuration），从而更准确地推断出锐度的局部强度。因此，如果捕获图 3 (f) 的梯度信息，则恢复图 3 (c) 的概率显著增加。SR 方法可以受益于这样的指导，以避免过度平滑或过度锐化的恢复。在梯度空间中更容易提取几何特征。因此，几何结构也可以很好地保留，从而产生更逼真的 SR 图像。
这里我们提出了一个梯度损失来实现上述目标。既然已经提到了梯度映射是反映图像结构信息的理想工具，它也可以作为一个二阶约束，为生成器提供监督。通过减小从 SR 图像提取的梯度图与对应的 HR 图像提取的梯度图之间的距离来表示梯度损失。在图像和梯度域的监督下，生成器不仅可以学习良好的外观，还可以注意避免详细的几何失真。因此，我们设计了两个损失项来弥补SR和HR图像的梯度图(GM)的差异。一个是基于像素损失，如下所示
gradient discriminator network
3.Learning Texture Transformer Network for Image Super-Resolution [pdf] [supp] [bibtex] Abstract
We study on image super-resolution (SR), which aims to recover realistic textures from a low-resolution (LR) image. Recent progress has been made by taking high-resolution images as references (Ref), so that relevant textures can be transferred to LR images. However, existing SR approaches neglect to use attention mechanisms to transfer high-resolution (HR) textures from Ref images, which limits these approaches in challenging cases.
motivition: 现有的 SR 方法忽略了使用注意力机制从参考图像 Ref 转移高分辨率 HR 纹理
In this paper, we propose a novel Texture Transformer Network for Image Super-Resolution (TTSR), in which the LR and Ref images are formulated as queries and keys in a transformer, respectively. TTSR consists of four closely-related modules optimized for image generation tasks, including a learnable texture extractor by DNN, a relevance embedding module, a hard-attention module for texture transfer, and a soft-attention module for texture synthesis. Such a design encourages joint feature learning across LR and Ref images, in which deep feature correspondences can be discovered by attention, and thus accurate texture features can be transferred. The proposed texture transformer can be further stacked in a cross-scale way, which enables texture recovery from different levels (e.g., from 1x to 4x magnification).
在本文中，提出了一种新的用于图像超分辨率 (TTSR) 的纹理变换网络，其中 LR 和 Ref 图像分别被表示为变换中的 查询 Q 和 键 K。TTSR 由四个模块组成，包括基于 DNN 的可学习纹理提取器、相关性嵌入模块、用于纹理传输的硬注意模块和用于纹理合成的软注意模块。这样的设计促进了 LR 和 Ref 图像的联合特征学习，通过注意可以发现深层特征对应，从而传递准确的纹理特征。所提出的纹理转换器可以进一步以跨尺度的方式堆叠，从而能够从不同的级别 (例如，从 1 倍到 4 倍放大) 恢复纹理。
Figure 2. The proposed texture transformer. Q, K and V are the texture features extracted from an up-sampled LR image, a sequentially down/up-sampled Ref image, and an original Ref image, respectively. H and S indicate the hard/soft attention map, calculated from relevance embedding. F is the LR features extracted from a DNN backbone, and is further fused with the transferred texture features T for generating the SR output.
Extensive experiments show that TTSR achieves significant improvements over state-of-the-art approaches on both quantitative and qualitative evaluations.
图 2 所示。
In Figure 2, LR, LR↑ and Ref represent the input image, the 4× bicubic-upsampled input image and the reference image, respectively.
There are four parts in the texture transformer: the learnable texture extractor (LTE), the relevance embedding module (RE), the hard-attention module for feature transfer (HA) and the soft-attention module for feature synthesis (SA).Learnable Texture Extractor
We design a learnable texture extractor whose parameters will be updated during end-to-end training. Such a design encourages a joint feature learning across the LR and Ref image, in which more accurate texture features can be captured. The process of texture extraction can be expressed as:
where LT E(·) denotes the output of our learnable texture extractor. The extracted texture features, Q (query), K (key), and V (value) indicate three basic elements of the attention mechanism inside a transformer and will be further used in our relevance embedding module.
特征提取：就是把卷积层的名字改成了 Learnable Texture Extractor，目的是为了辅助本文的 texture 这个主线。
这里注意 Ref ↓↑。这个操作是先对 Ref 图片做双三次（bicubic）下采样，再做双三次上采样，目的是为了和 LR↑ 保持域一致（即都是经过双三次变换得到的）。Relevance Embedding
Relevance embedding aims to embed the relevance between the LR and Ref image by estimating the similarity between Q and K. We unfold both Q and K into patches , denoted as . Then for each patch in Q and in K, we calculate the relevance between these two patches by normalized inner product:
计算 Q 和 K 之间 patch-wise 相关性。注意，最后计算得到的 r 是一个 的矩阵，其中。Hard-Attention
We propose a hard-attention module to transfer the HR texture features V from the Ref image. Traditional attention mechanism takes a weighted sum of V for each query . However, such an operation may cause blur effect which lacks the ability of transferring HR texture features. Therefore, in our hard-attention module, we only transfer features from the most relevant position in V for each query .
More specifically, we first calculate a hard-attention map H in which the i-th element is calculated from the relevance :
The value of can be regarded as a hard index, which represents the most relevant position in the Ref image to the -th position in the LR image. To obtain the transferred HR texture features T from the Ref image, we apply an index selection operation to the unfolded patches of V using the hard-attention map as the index:
where denotes the value of T in the i-th position, which is selected from the -th position of V . As a result, we obtain a HR feature representation T for the LR image which will be further used in our softattention module.
传统的注意机制对每个查询 Q 取 V 的加权和。但是这样的操作可能会产生模糊效果（在一个位置 point 上，各通道的聚合，聚合方式确实不能很明确地 extract 最重要的那个元素），缺乏传递 HR 纹理特征的能力。因此，在我们的 hard-attention 模块中，对于每个查询 Q，只从 V 中最相关的位置转移特征。
更具体地说，首先计算一个 hard-attention map H，按照上述公式， 表示的是当 取最大值（与 LR 图像位置 最相关）的位置索引 。
为了获得 Ref 图像中转移的 HR 纹理特征 T，以 hard-attention map 为索引，对 V 中展开的 patch 进行索引选择操作。这句话的意思是，T 在位置 的取值是 V 中以 为中心的 patch 中， 索引的位置的取值（有点绕）。Soft-Attention
We propose a soft-attention module to synthesize features from the transferred HR texture features T and the LR features F of the LR image from a DNN backbone. During the synthesis process, relevant texture transfer should be enhanced while the less relevant ones should be relived. To achieve that, a soft-attention map S is computed from to represent the confidence of the transferred texture features for each position in T:
where denotes the -th position of the soft-attention map S. Instead of directly applying the attention map S to T, we first fuse the HR texture features T with the LR features F to leverage more information from the LR image. Such fused features are further element-wisely multiplied by the soft-attention map S and added back to F to get the final output of the texture transformer. This operation can be represented as:
where indicates the synthesized output features. Conv and Concat represent a covolutional layer and Concatenation operation, respectively. The operator denotes element-wise multiplication between feature maps.
很有意思，它是取 中第 列的最大值，而 是取 所在的列 。
不是直接将注意力图 S 应用到 T 上，而是首先融合 HR 纹理特征 T 和 LR 特征F，以从 LR 图像中获取更多信息。这种融合的特征是与 soft-attention map 进行元素相乘，再加上 F，得到纹理转换器的最终输出。
综上所述，纹理转换器 texture transformer 可以有效地将 Ref 图像中的相关 HR 纹理特征转换为 LR 特征，提高纹理生成的精度。
Cross-Scale Feature Integration
Our texture transformer can be further stacked in a crossscale way with a cross-scale feature integration module. The architecture is shown in Figure 3. Stacked texture transformers output the synthesized features for three resolution scales (1×, 2× and 4×), such that the texture features of different scales can be fused into the LR image. To learn a better representation across different scales, inspired by [25, 37], we propose a cross-scale feature integration module (CSFI) to exchange information among the features of different scales. A CSFI module is applied each time the LR feature is up-sampled to the next scale. For the each scale inside the CSFI module, it receives the exchanged features from other scales by up/down-sampling, followed by a concatenation operation in the channel dimension. Then a convolutional layer will map the features into the original number of channels. In such a design, the texture features transferred from the stacked texture transformers are exchanged across each scale, which achieves a more powerful feature representation. This cross-scale feature integration module further improves the performance of our approach.
Figure 3. Architecture of stacking multiple texture transformers in a cross-scale way with the proposed cross-scale feature integration module (CSFI). RBs indicates a group of residual blocks.
纹理转换器可以通过一个跨尺度特征集成模块进一步以交叉尺度的方式堆叠。该体系结构如图3所示。堆叠纹理变换输出三种分辨率尺度 (1,2,4) 的合成特征，从而将不同尺度的纹理特征融合到 LR 图像中。为了学习更好的跨尺度表征，受 [25,37] 的启发，提出了跨尺度特征集成模块 (CSFI)，用于在不同尺度的特征之间交换信息。每次将LR特性向上采样到下一个级别时，都会应用 CSFI模块。对于 CSFI 模块内的每个尺度，它通过上/下采样从其他尺度接收交换的特征，然后在信道维度进行连接操作。然后一个卷积层将特征映射到原始的通道数量。在这种设计中，从堆叠的纹理变形器中传输的纹理特征在各个尺度上进行交换，实现了更强大的特征表示。Loss Function
Deep Unfolding Network for Image Super-Resolution [pdf] [bibtex]
Learning-based single image super-resolution (SISR) methods are continuously showing superior effectiveness and efficiency over traditional model-based methods, largely due to the end-to-end training. However, different from model-based methods that can handle the SISR problem with different scale factors, blur kernels and noise levels under a unified MAP (maximum a posteriori) framework, learning-based methods generally lack such flexibility. To address this issue, this paper proposes an end-to-end trainable unfolding network which leverages both learningbased methods and model-based methods. Specifically, by unfolding the MAP inference via a half-quadratic splitting algorithm, a fixed number of iterations consisting of alternately solving a data subproblem and a prior subproblem can be obtained. The two subproblems then can be solved with neural modules, resulting in an end-to-end trainable, iterative network. As a result, the proposed network inherits the flexibility of model-based methods to super-resolve blurry, noisy images for different scale factors via a single model, while maintaining the advantages of learning-based methods. Extensive experiments demonstrate the superiority of the proposed deep unfolding network in terms of flexibility, effectiveness and also generalizability.
本文参考我的博客：MyDLNote-Enhancment: [SR转文] Deep Unfolding Network for Image Super-Resolution
Meta-Transfer Learning for Zero-Shot Super-Resolution [pdf] [supp] [bibtex]
Convolutional neural networks (CNNs) have shown dramatic improvements in single image super-resolution (SISR) by using large-scale external samples. Despite their remarkable performance based on the external dataset, they cannot exploit internal information within a specific image. Another problem is that they are applicable only to the specific condition of data that they are supervised. For instance, the low-resolution (LR) image should be a "bicubic" downsampled noise-free image from a high-resolution (HR) one. To address both issues, zero-shot super-resolution (ZSSR) has been proposed for flexible internal learning. However, they require thousands of gradient updates, i.e., long inference time. In this paper, we present Meta-Transfer Learning for Zero-Shot Super-Resolution (MZSR), which leverages ZSSR. Precisely, it is based on finding a generic initial parameter that is suitable for internal learning. Thus, we can exploit both external and internal information, where one single gradient update can yield quite considerable results. With our method, the network can quickly adapt to a given image condition. In this respect, our method can be applied to a large spectrum of image conditions within a fast adaptation process.
本文参考我的博客 MyDLNote-Enhancement:[SR转文] 2020CVPR : Meta-Transfer Learning for Zero-Shot Super-Resolution
第二期内容预览：Closed-Loop Matters: Dual Regression Networks for Single Image Super-Resolution [pdf] [supp] [bibtex]
Deep neural networks have exhibited promising performance in image super-resolution (SR) by learning a nonlinear mapping function from low-resolution (LR) images to high-resolution (HR) images. However, there are two underlying limitations to existing SR methods. First, learning the mapping function from LR to HR images is typically an ill-posed problem, because there exist infinite HR images that can be downsampled to the same LR image. As a result, the space of the possible functions can be extremely large, which makes it hard to find a good solution. Second, the paired LR-HR data may be unavailable in real-world applications and the underlying degradation method is often unknown. For such a more general case, existing SR models often incur the adaptation problem and yield poor performance. To address the above issues, we propose a dual regression scheme by introducing an additional constraint on LR data to reduce the space of the possible functions. Specifically, besides the mapping from LR to HR images, we learn an additional dual regression mapping estimates the down-sampling kernel and reconstruct LR images, which forms a closed-loop to provide additional supervision. More critically, since the dual regression process does not depend on HR images, we can directly learn from LR images. In this sense, we can easily adapt SR models to real-world data, e.g., raw video frames from YouTube. Extensive experiments with paired training data and unpaired real-world data demonstrate our superiority over existing methods.
Residual Feature Aggregation Network for Image Super-Resolution [pdf] [supp] [bibtex]
Recently, very deep convolutional neural networks (CNNs) have shown great power in single image super-resolution (SISR) and achieved significant improvements against traditional methods. Among these CNN-based methods, the residual connections play a critical role in boosting the network performance. As the network depth grows, the residual features gradually focused on different aspects of the input image, which is very useful for reconstructing the spatial details. However, existing methods neglect to fully utilize the hierarchical features on the residual branches. To address this issue, we propose a novel residual feature aggregation (RFA) framework for more efficient feature extraction. The RFA framework groups several residual modules together and directly forwards the features on each local residual branch by adding skip connections. Therefore, the RFA framework is capable of aggregating these informative residual features to produce more representative features. To maximize the power of the RFA framework, we further propose an enhanced spatial attention (ESA) block to make the residual features to be more focused on critical spatial contents. The ESA block is designed to be lightweight and efficient. Our final RFANet is constructed by applying the proposed RFA framework with the ESA blocks. Comprehensive experiments demonstrate the necessity of our RFA framework and the superiority of our RFANet over state-of-the-art SISR methods.
Correction Filter for Single Image Super-Resolution: Robustifying Off-the-Shelf Deep Super-Resolvers [pdf] [supp] [bibtex]
The single image super-resolution task is one of the most examined inverse problems in the past decade. In the recent years, Deep Neural Networks (DNNs) have shown superior performance over alternative methods when the acquisition process uses a fixed known downscaling kernel---typically a bicubic kernel. However, several recent works have shown that in practical scenarios, where the test data mismatch the training data (e.g. when the downscaling kernel is not the bicubic kernel or is not available at training), the leading DNN methods suffer from a huge performance drop. Inspired by the literature on generalized sampling, in this work we propose a method for improving the performance of DNNs that have been trained with a fixed kernel on observations acquired by other kernels. For a known kernel, we design a closed-form correction filter that modifies the low-resolution image to match one which is obtained by another kernel (e.g. bicubic), and thus improves the results of existing pre-trained DNNs. For an unknown kernel, we extend this idea and propose an algorithm for blind estimation of the required correction filter. We show that our approach outperforms other super-resolution methods, which are designed for general downscaling kernels.
Image Super-Resolution With Cross-Scale Non-Local Attention and Exhaustive Self-Exemplars Mining [pdf] [supp] [bibtex]
Deep convolution-based single image super-resolution (SISR) networks embrace the benefits of learning from large-scale external image resources for local recovery, yet most existing works have ignored the long-range feature-wise similarities in natural images. Some recent works have successfully leveraged this intrinsic feature correlation by exploring non-local attention modules. However, none of the current deep models have studied another inherent property of images: cross-scale feature correlation. In this paper, we propose the first Cross-Scale Non-Local (CS-NL) attention module with integration into a recurrent neural network. By combining the new CS-NL prior with local and in-scale non-local priors in a powerful recurrent fusion cell, we can find more cross-scale feature correlations within a single low-resolution (LR) image. The performance of SISR is significantly improved by exhaustively integrating all possible priors. Extensive experiments demonstrate the effectiveness of the proposed CS-NL module by setting new state-of-the-arts on multiple SISR benchmarks.