Puzzle Mix: Exploiting Saliency and Local Statistics for Optimal Mixup

SNU AI

Follow

Published in

SNU AIIS Blog

6 min readApr 3, 2022

--

By Seyeon An

Deep neural networks are used in almost all fields requiring artificial intelligence, as object recognition, speech, natural language processing, and reinforcement learning. Hence, they have became the bedrock of modern AI related tasks. Yet, there are limitations as well — as these models completely memorize the training data and make overconfident predictions — which commonly lead to degraded generalization performance on test examples. Data augmentation approaches to alleviate these problems, by improving model generalization performance by adding slightly modified copies of already existing data or newly created synthetic data from existing data.

Mixup, a data augmentation method which mainly focuses on creating previously unseen virtual mixup examples by interpolating a pair of given data, has been recently proposed by a group of researchers as a plausible solution for data augmentation. Let’s first take a look at what mixup has added up to the line of artificial intelligence development:

Mixup

Mixup mainly targets to alleviate the over-confidence issue of deep neural networks, using empirical risk minimization — which leads to high prediction probabilities even for the unseen inputs or wrong predictions.

Convex combinations of data are used for this method. Specifically, for the given input and label pairs (x_1,y_1) and (x_2, y_2), mixup generates

for a mixing ratio λ in [0,1].

The generalization performance on various tasks — including image classification and speech recognition — is improved when networks are trained with mixup data. Moreover, mixup improves the adversarial robustness (networks’ resilience against adversarial inputs) of the networks, and thus stabilizes the training of networks.

So after the proposal of the first mixup technique — input mixup — there has been more suggestions, including manifold mixup which performs input mixup with the hidden representation of data, and CutMix, performing cut-and-paste of a rectangular box of an image.

Image samples of existing mixup methods. Input mixup does not preserve local statistics (e.g., color), and CutMix does not preserve salient information.

Yet, some flaws have been discovered in the previously proposed mixup methods, which could lead to giving false supervisory signals to the networks. Input mixup does not preserve the color information of data, and Cutmix tends to delete some informative regions of data.

Striving to find a mixup method that could alleviate these problems, we have come up with Puzzle Mix, which overcomes these issues by considering saliency (usefulness of information) and local statistics of data.

Puzzle Mix

Puzzle Mix : Method

We calculate a saliency map of data
We calculate a down-sampled saliency map of data
Based on the saliency, we assign labels indicating a ratio between input sources at each location of the mixup output
We perform transport for each input to maximize saliency information

The following is an image sample of Puzzle Mix:

The following formula is used for mixup, given n-dimensional data x_1 and x_2:

Here, z is a n-dimensional mask having values in [0,1], and Π_k is a transportation plan which is a n×n binary matrix.

The goal of PuzzleMix is to optimize the mask z and the transportation plans maximizing salient information of each data while preserving local statistics.

It is effective in doing so, as shown in the figure below:

Statistics of various mixup data along mixing ratios in $[0,1]$. Saliency represents remained saliency information in the mixed data. Total variation is measured in each mixup data. Loss means cross-entropy loss of the pretrained model on mixup data and Top-k acc. means the prediction accuracy of the model on the mixup data. — Statistics of various mixup data along mixing ratios in [0,1]. Saliency represents remained saliency information in the mixed data. Total variation is measured in each mixup data. Loss means cross-entropy loss of the pretrained model on mixup data and Top-k acc. means the prediction accuracy of the model on the mixup data.

The figure above shows that Puzzle Mix data

preserves saliency information (Saliency)
preserves local statistics (Total variation)
is more consistent with the target soft label than other mixup methods (Loss, Top-k acc.).

Objective and Algorithm

What is the algorithm behind PuzzleMix? In other words, how can we find the optimal mask and transportation plans maximizing saliency, while maintaining local statistics?

Simply, we first discretize the mask value, and we denote the discretized range as 𝓛.

For a given saliency map s_k of each data, we solve the following optimization problem to obtain the optimal mask and transportation plans.

The first term of the objective represents the negative of remained saliency information.
The next term is a penalty for local statistics ψ for the smoothness of the mask labels and ϕ for the smoothness of the mixup data.
The prior term p controls the mixing ratio of inputs.
We penalize the transport cost to preserve local statistics of data. To solve the minimization problem, we propose an alternating algorithm between a mask z and transportation plans.

Puzzle Mix images with increasing smoothness coefficient β and γ.

Puzzle Mix images with increasing mixing ratio λ.

1) Optimize Mask

How do we optimize the masks? Saliency, smoothness, and mixing ratio prior terms are the terms related to the mask.

The ℓ_1 norm is represented in the saliency term as a unary term $u_i$, and the objective with respect to a mask is represented as the following:

The pairwise smoothness terms would be referred to as the submodular functions from now on. Then, we apply a submodular minimization algorithm, alpha-beta swap, to obtain the optimal mask $z$.

2) Optimize Transportation Plan

Hence, the objective with respect to transportation plans becomes

The above problem is separable for each transportation plan Π_k, and we perform the optimization independently for each index.

3) Adversarial Training and Regularization

We obtain the saliency map s_k by calculating the gradients of loss with respect to inputs and taking ℓ_2 norm across the input channels. Moreover, we utilize gradient information to improve the Puzzle Mix training algorithm.

We propose an adversarial training with mixup data by adding the adversarial perturbation on each input before mixup.
We regularize the training objective by using the gradient information with respect to the network weights on clean input data:

where ℓ is a training objective, and θ represents the network weights.

Experimental Results

Time for approval: we have contrasted Puzzle Mix with other mixup models, to see if they really excel in image classification tasks. In this article, we provide the results on CIFAR-100 with PreActResNet-18 and ImageNet with ResNet-50. Puzzle Mix outperforms mixup baselines at generalization and adversarial robustness at the same time with a large margin on various datasets and models, as seen in the table below:

Left: Performance on CIFAR-100 with PreActResNet-18. Note, half represents the model trained with the same number of network forward calls and adv represents the adversarially trained model. Right: Performance on ImageNet with PreActResNet-18.

Conclusion

Puzzle Mix is useful for optimally leveraging the saliency information, while respecting the underlying local statistics of the data. It could function as a baseline mixup augmentation method for further research, as it clearly outperforms baseline methodologies in terms of generalization performance and robustness against adversarial perturbations.

We hope that our research could provide as bedrock of further development of deep neural networks, and the image recognition domain in general.

Acknowledgements

We thank Jang-Hyun Kim and the co-authors of the paper “Puzzle Mix: Exploiting Saliency and Local Statistics for Optimal Mixup” for their contributions and discussions in preparing this blog. The views and opinions expressed in this blog are solely of the authors.

This post is based on the following paper:

Puzzle Mix: Exploiting Saliency and Local Statistics for Optimal Mixup, Jang-Hyun Kim, Wonho Choo, Hyun Oh Song, International Conference on Machine Learning (ICML) 2021, arXiv, GitHub.

This post was originally published on our Notion blog, at January 5, 2022.