[go: up one dir, main page]

Create a mirror of the MNIST dataset to unload the original dataset and improve reliability

Recently, we saw an increasing number of errors due to the MNIST dataset site (http://yann.lecun.com/exdb/mnist/) being unavailable (like https://pragit.diee.unica.it/secml/secml/-/jobs/55110). We use the dataset downloaded from this site in our unittests and as part of the CCDataLoaderMNIST class.

Actually, as stated in the original site, we are encouraged to make copies while loading the dataset from CI scripts and similar: "Please refrain from accessing these files from automated scripts with high frequency. Make copies!"

We can thus create a mirror in our model-zoo (issue https://gitlab.com/secml/secml-zoo/-/issues/14) and update CCDataLoaderMNIST to download the files from there. Use our dl_file_gitlab function to this end, pointing to our model zoo once the mirror has been created.

EDIT We observed that the issues are involving the torchvision dataloader too. As reported on github (https://github.com/pytorch/vision/pull/3544) they are updating the loader with new mirrors, but this will be available only on version 0.9.1. To fix our tests we are going to use a workaround and setting the following in TestCDataLoaderTorchDataset._create_ds:

torchvision_dataset.resources = [
            ('https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz', 'f68b3c2dcbeaaa9fbdd348bbdeb94873'),
            ('https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz', 'd53e105ee54ea40749a09fcbcd1e9432'),
            ('https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz', '9fb629c4189551a2d022fa330f9573f3'),
            ('https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz', 'ec29112dd5afa0611ce80d1b7f02629c')
        ]
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information