Method | Extra Data | Backbone | Epoch | #Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
---|---|---|---|---|---|---|---|---|
VideoMAE | no | ViT-S | 1600 | 16x5x3 | script/log/checkpoint | script/log/checkpoint | 79.0 | 93.8 |
VideoMAE | no | ViT-B | 800 | 16x5x3 | script/log/checkpoint | script/log/checkpoint (w/o repeated aug) |
80.0 | 94.4 |
VideoMAE | no | ViT-B | 800 | 16x5x3 | same as above | TODO | 81.0 | 94.8 |
VideoMAE | no | ViT-B | 1600 | 16x5x3 | script/log/checkpoint | script/log/checkpoint | 81.5 | 95.1 |
VideoMAE | no | ViT-L | 1600 | 16x5x3 | script/log/checkpoint | script/log/checkpoint | 85.2 | 96.8 |
VideoMAE | no | ViT-H | 1600 | 16x5x3 | script/log/checkpoint | script/log/checkpoint | 86.6 | 97.1 |
Method | Extra Data | Backbone | Epoch | #Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
---|---|---|---|---|---|---|---|---|
VideoMAE | no | ViT-S | 2400 | 16x2x3 | script/log/checkpoint | script/log/checkpoint | 66.8 | 90.3 |
VideoMAE | no | ViT-B | 800 | 16x2x3 | script/log/checkpoint | script/log/checkpoint (w/o repeated aug) |
69.6 | 92.0 |
VideoMAE | no | ViT-B | 2400 | 16x2x3 | script/log/checkpoint | script/log/checkpoint | 70.8 | 92.4 |
Method | Extra Data | Backbone | Epoch | #Frame | Pre-train | Fine-tune | Top-1 | Top-5 |
---|---|---|---|---|---|---|---|---|
VideoMAE | no | ViT-B | 3200 | 16x5x3 | script/log/checkpoint | script/log/checkpoint | 91.3 | 98.5 |
- We report the results of VideoMAE finetuned with
I3D dense sampling
on Kinetics400 andTSN uniform sampling
on Something-Something V2, respectively. - #Frame = #input_frame x #clip x #crop.
- #input_frame means how many frames are input for model during the test phase.
- #crop means spatial crops (e.g., 3 for left/right/center crop).
- #clip means temporal clips (e.g., 5 means repeted temporal sampling five clips with different start indices).