SVHN-Remix

The SVHN Dataset Is Deceptive for Probabilistic Generative Models Due to a Distribution Mismatch

Tim Z. Xiao1,2,*     Johannes Zenn1,2,*     Robert Bamler1
1University of Tübingen     2IMPRS-IS
*Equal contribution, order determined by coin flip.

Abstract

The Street View House Numbers (SVHN) dataset (Netzer et al., 2011) is a popular benchmark dataset in deep learning. Originally designed for digit classification tasks, the SVHN dataset has been widely used as a benchmark for various other tasks including generative modeling. However, with this work, we aim to warn the community about an issue of the SVHN dataset as a benchmark for generative modeling tasks: we discover that the official split into training set and test set of the SVHN dataset are not drawn from the same distribution. We empirically show that this distribution mismatch has little impact on the classification task (which may explain why this issue has not been detected before), but it severely affects the evaluation of probabilistic generative models, such as Variational Autoencoders and diffusion models. As a workaround, we propose to mix and re-split the official training and test set when SVHN is used for tasks other than classification. We publish a new split that we call SVHN-Remix and the indices we used to create it below.

What's Wrong With SVHN?

We show parts of our analysis of the distribution mismatch in SVHN. More details on our method and the results can be found in the paper.

We measure how similar random subsets of training and test set are, and we compare this to how similar two non-overlapping random subsets of the training set are. In an unbiased training/test split (as, e.g., in CIFAR), both measurements should exhibit similar distances. But we found that, in SVHN, differences between training and test set are much more different than two random subsets of the training data. This is shown in Table 1 below. In detail, we measure distances in Fréchet inception distance (FID), which measures semantic dissimilarity between two finite sets of images with a feature extractor.

Table 1 below shows FIDs evaluated between random subsets of the training set ($\mathcal{D}_\text{train}''$) and the test set ($\mathcal{D}_\text{test}'$). Additionally, Table 1 shows the FID evaluate between two random non-overlapping subsets of the training set ($\mathcal{D}_\text{train}''$ and $\mathcal{D}_\text{train}'$). We find that the FID between training set and test set differs significantly from the FID evaluated between two random non-overlapping subsets of the training set indicating that there is a distribution mismatch between $\mathcal{D}_\text{train}$ and $\mathcal{D}_\text{test}$ for SVHN. As a comparison, for CIFAR-10, the two FIDs are basically indistinguishable.

The inception score (IS) evaluates how well data points in a set can be classified with a classifier that was trained on a training set, and how diverse their labels are. We calculate the IS of a subset of the training set and a subset of the test set utilizing all remaining training samples as a training set for the classifier. There is not much difference between the IS of the subset of the training set and the subset of the test set for both SVHN and CIFAR-10 as Table 1 shows. This tells us that if we want to measure the sample quality in terms of distribution similarity, we should not use IS as the metric.

Table 1: FID (lower means larger similarity) and IS (higher means better sample quality) on three datasets, averaged over 5 random seeds. For SVHN, we find that the FID between random subsets of the training and test set (bold red) is significantly higher than the FID between non-overlapping subsets of the training set of the same size, while IS for $\mathcal{D}_\text{train}'$ and $\mathcal{D}_\text{test}'$ is similar within all datasets.

FID ($\downarrow$), IS ($\uparrow$) SVHN SVHN-Remix CIFAR-10
$\mathrm{FID}(\mathcal{D}_\text{train}'', \mathcal{D}_\text{train}')$ 3.309 $\pm$ 0.029 3.334 $\pm$ 0.018 5.196 $\pm$ 0.040
$\mathrm{FID}(\mathcal{D}_\text{train}'', \mathcal{D}_\text{test}')$ 16.687 $\pm$ 0.325 3.326 $\pm$ 0.015 5.206 $\pm$ 0.031
$\mathrm{IS}(\mathcal{D}_\text{train}', \bar{\mathcal{D}}_\text{train})$ 8.507 $\pm$ 0.114 8.348 $\pm$ 0.568 7.700 $\pm$ 0.043
$\mathrm{IS}(\mathcal{D}_\text{test}', \bar{\mathcal{D}}_\text{train})$ 8.142 $\pm$ 0.501 8.269 $\pm$ 0.549 7.692 $\pm$ 0.023

SVHN-Remix: A New Split

We propose a new split called SVHN-Remix to alleviate the distribution mismatch in SVHN. SVHN-Remix is created by (i) joining the original train set and test set, (ii) shuffling the indices, and (iii) re-splitting the index into a new remixed train set and test set. We make sure the size of the new training and test set is the same as before, and the number of samples for each class is also preserved for both the new training and test set. We provide a notebook that implements this process.

Implications

Below, we summarize the key points of the analysis of implications. More details on our method and the results can be found in the paper.

We find that the distribution mismatch has little effect on classification tasks for supervised learning, or on sample quality for generative modeling. Figure 1 (left) shows the loss of a classifier trained on SVHN and SVHN-Remix evaluated on the training sets and test sets. SVHN and SVHN-Remix show similar losses.

However, we show that for probabilistic generative models such as Variational Autoencoders (VAEs) and variaional diffusion models (VDM), the mismatch leads to a false assessment of the model performance when evaluating test set likelihoods: test set likelihoods on the SVHN dataset are deceptive since the test set appears to be drawn from a simpler distribution than the training set. Figure 1 (middle left), Figure 1 (middle right), and Figure 1 (right) show bits per dimension (BPD) for SVHN and SVHN-Remix evaluated on the training set and test set during training. For SVHN, the order of training and test set BPD is flipped compared to SVHN-Remix. Since we normally evaluate probabilistic generative models models by their likelihood on the test set, a distribution mismatch between the training set and test set can lead to false evaluation of these models.

Figure 1: (left): classification loss evaluated on training set (dashed) and test set (solid) on SVHN (blue) and SVHN-Remix (green) averaged over five random seeds (lines are means, shaded areas are one standard deviation). The losses are similar. (middle left) and (middle right) and (right): bits per dimension (BPD) evaluated as a function of training progress on the training set (dotted) and test set (solid) for SVHN (blue) and SVHN-Remix (green). For SVHN, the order of training and test set performance is flipped compared to SVHN-Remix.

Classifier loss. Classifier loss. Classifier loss.

Download

We provide the new split SVHN-Remix below for download. We provide files with the original file types and the indices used to create them. Based on the SVHN dataset, the SVHN-Remix dataset is for non-commercial use only.

BibTeX


    @article{xiao2023the,
      title={The SVHN Dataset Is Deceptive for Probabilistic Generative Models Due to a Distribution Mismatch},
      author={Xiao, Tim Z. and Zenn, Johannes and Bamler, Robert},
      journal={NeurIPS 2023 Workshop on Distribution Shifts},
      year={2023}
    }