We show parts of our analysis of the distribution mismatch in SVHN. More details on our method and the results can be found in the paper.
We measure how similar random subsets of training and test set are, and we compare this to how similar two non-overlapping random subsets of the training set are. In an unbiased training/test split (as, e.g., in CIFAR), both measurements should exhibit similar distances. But we found that, in SVHN, differences between training and test set are much more different than two random subsets of the training data. This is shown in Table 1 below. In detail, we measure distances in Fréchet inception distance (FID), which measures semantic dissimilarity between two finite sets of images with a feature extractor.
Table 1 below shows FIDs evaluated between random subsets of the training set ($\mathcal{D}_\text{train}''$) and the test set ($\mathcal{D}_\text{test}'$). Additionally, Table 1 shows the FID evaluate between two random non-overlapping subsets of the training set ($\mathcal{D}_\text{train}''$ and $\mathcal{D}_\text{train}'$). We find that the FID between training set and test set differs significantly from the FID evaluated between two random non-overlapping subsets of the training set indicating that there is a distribution mismatch between $\mathcal{D}_\text{train}$ and $\mathcal{D}_\text{test}$ for SVHN. As a comparison, for CIFAR-10, the two FIDs are basically indistinguishable.
The inception score (IS) evaluates how well data points in a set can be classified with a classifier that was trained on a training set, and how diverse their labels are. We calculate the IS of a subset of the training set and a subset of the test set utilizing all remaining training samples as a training set for the classifier. There is not much difference between the IS of the subset of the training set and the subset of the test set for both SVHN and CIFAR-10 as Table 1 shows. This tells us that if we want to measure the sample quality in terms of distribution similarity, we should not use IS as the metric.
Table 1: FID (lower means larger similarity) and IS (higher means better sample quality) on three datasets, averaged over 5 random seeds. For SVHN, we find that the FID between random subsets of the training and test set (bold red) is significantly higher than the FID between non-overlapping subsets of the training set of the same size, while IS for $\mathcal{D}_\text{train}'$ and $\mathcal{D}_\text{test}'$ is similar within all datasets.
FID ($\downarrow$), IS ($\uparrow$) | SVHN | SVHN-Remix | CIFAR-10 |
---|---|---|---|
$\mathrm{FID}(\mathcal{D}_\text{train}'', \mathcal{D}_\text{train}')$ | 3.309 $\pm$ 0.029 | 3.334 $\pm$ 0.018 | 5.196 $\pm$ 0.040 |
$\mathrm{FID}(\mathcal{D}_\text{train}'', \mathcal{D}_\text{test}')$ | 16.687 $\pm$ 0.325 | 3.326 $\pm$ 0.015 | 5.206 $\pm$ 0.031 |
$\mathrm{IS}(\mathcal{D}_\text{train}', \bar{\mathcal{D}}_\text{train})$ | 8.507 $\pm$ 0.114 | 8.348 $\pm$ 0.568 | 7.700 $\pm$ 0.043 |
$\mathrm{IS}(\mathcal{D}_\text{test}', \bar{\mathcal{D}}_\text{train})$ | 8.142 $\pm$ 0.501 | 8.269 $\pm$ 0.549 | 7.692 $\pm$ 0.023 |
Below, we summarize the key points of the analysis of implications. More details on our method and the results can be found in the paper.
We find that the distribution mismatch has little effect on classification tasks for supervised learning, or on sample quality for generative modeling. Figure 1 (left) shows the loss of a classifier trained on SVHN and SVHN-Remix evaluated on the training sets and test sets. SVHN and SVHN-Remix show similar losses.
However, we show that for probabilistic generative models such as Variational Autoencoders (VAEs) and variaional diffusion models (VDM), the mismatch leads to a false assessment of the model performance when evaluating test set likelihoods: test set likelihoods on the SVHN dataset are deceptive since the test set appears to be drawn from a simpler distribution than the training set. Figure 1 (middle left), Figure 1 (middle right), and Figure 1 (right) show bits per dimension (BPD) for SVHN and SVHN-Remix evaluated on the training set and test set during training. For SVHN, the order of training and test set BPD is flipped compared to SVHN-Remix. Since we normally evaluate probabilistic generative models models by their likelihood on the test set, a distribution mismatch between the training set and test set can lead to false evaluation of these models.
Figure 1: (left): classification loss evaluated on training set (dashed) and test set (solid) on SVHN (blue) and SVHN-Remix (green) averaged over five random seeds (lines are means, shaded areas are one standard deviation). The losses are similar. (middle left) and (middle right) and (right): bits per dimension (BPD) evaluated as a function of training progress on the training set (dotted) and test set (solid) for SVHN (blue) and SVHN-Remix (green). For SVHN, the order of training and test set performance is flipped compared to SVHN-Remix.
We provide the new split SVHN-Remix below for download. We provide files with the original file types and the indices used to create them. Based on the SVHN dataset, the SVHN-Remix dataset is for non-commercial use only.
@article{xiao2023the,
title={The SVHN Dataset Is Deceptive for Probabilistic Generative Models Due to a Distribution Mismatch},
author={Xiao, Tim Z. and Zenn, Johannes and Bamler, Robert},
journal={NeurIPS 2023 Workshop on Distribution Shifts},
year={2023}
}