Momentum contrast in Frequency and Spatial Domain (MocoFSD) inspired by the Moco [1] framework learns feature representation by combining the frequency and spatial domain information during the pre-training phase. Features learned by MocoFSD, outperform its self-supervised and supervised counterparts on two downstream tasks, fine-grained image classification, and image classification.
This project was part of the research internship done under Prof. Jiapan Guo at the University of Groningen for the course WMCS021-15.
1. Clone the repository:
$ git clone git@github.com:Rohit8y/MocoFSD.git
$ cd MocoFSD
2. Create a new Python environment and activate it:
$ python3 -m venv py_env
$ source py_env/bin/activate
3. Install necessary packages:
$ pip install -r requirements.txt
- Download the ImageNet dataset from http://www.image-net.org/.
- Then, move and extract the training and validation images to labeled subfolders, using the following shell script
- The following fine-tuning datasets will be downloaded using the PyTorch API automatically in the code.
- Stanford Dogs
- Stanford Cars
- FGVC Aircraft
- CIFAR 100
- DTD
This implementation only supports multi-gpu, DistributedDataParallel training, which is faster and simpler; single-gpu or DataParallel training is not supported.
To do self-supervised pre-training of a ResNet-50 model on ImageNet in an 8-gpu machine, run:
python pretrain.py \
-a resnet50 \
--lr 0.03 \
--batch-size 256 \
--mlp --moco-t 0.2 --aug-plus --cos \
--dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 \
[your imagenet-folder with train and val folders]
Note: for 4-gpu training, we recommend following the linear lr scaling recipe: --lr 0.015 --batch-size 128
with 4 gpus.
positional arguments:
DIR path to dataset (default: imagenet)
optional arguments:
--help show this help message and exit
--arch model architecture: resnet18 | resnet34 | resnet50 (default: resnet18)
--workers number of data loading workers (default: 4)
--epochs number of total epochs to run
--start-epoch N manual epoch number (useful on restarts)
--batch-size mini-batch size (default: 256), this is the total batch size of all GPUs on the current node when using Data Parallel or Distributed Data Parallel
--lr initial learning rate
--momentum momentum
--weight-decay weight decay (default: 1e-4)
--resume path to latest checkpoint (default: none)
--evaluate evaluate model on validation set
--pretrained use pre-trained model
--world-size number of nodes for distributed training
--rank node rank for distributed training
--dist-url url used to set up distributed training
--dist-backend distributed backend
--seed seed for initializing training.
--gpu GPU id to use.
--multiprocessing-distributed
Use multi-processing distributed training to launch N processes per node, which has N GPUs. This is the fastest way to use PyTorch for either single node or multi node data parallel
training
Using the pre-trained model we have given the option to fine-tune on five downstream datasets: Stanford Cars, Stanford Dogs, CIFAR100, FGVC Aircraft, and DTD. To optimise these models for the downstream task, run:
python main.py -h
usage: main.py [-h] [--arch ARCH] [--epochs EPOCHS] [--lr LR] [--batch-size BS]
[--wd WD][--dataset DATASET] [--model MODEL]
options:
--help show this help message and exit
--arch model architecture: resnet18 | resnet34 | resnet50 (default: resnet18)
--epochs number of total epochs to run (default: 100)
--lr initial learning rate (default: 0.001)
--batch-size mini-batch size (default: 32)
--wd weight decay (default: 1e-4)
--dataset fine-tuning dataset to usage: stanfordCars | stanfordDogs | aircraft | cifar100 | dtd (default: stanfordCars)
--model path to the pre-trained model
This implementation works on single-gpu and does not support multi-gpu training. We used a grid search to find the optimal value of other hyperparameters. Once the training process is completed, the final model will be saved by the name <dataset_name>_best_model.pth.tar
[1] He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9729-9738).