MACHINE LEARNING BERKELEY The bad ●' "We train iGPT-S,iGPT-M,and iGPT-L,transformers containing 76M,455M,and 1.4B parameters respectively,on ImageNet.We also train iGPT-XL,a 6.8 billion parameter transformer,on a mix of ImageNet and images from the web." ● "iGPT-L was trained for roughly 2500 V100-days while a similarly performing MoCo model can be trained in roughly 70 V100-days" o For reference,MoCo is another self-supervised model but it has a ResNet backbone that is capable of handling a 224 x 224 image resolution All that for only a 64x64 resolution! The bad ● “We train iGPT-S, iGPT-M, and iGPT-L, transformers containing 76M, 455M, and 1.4B parameters respectively, on ImageNet. We also train iGPT-XL, a 6.8 billion parameter transformer, on a mix of ImageNet and images from the web.” ● “iGPT-L was trained for roughly 2500 V100-days while a similarly performing MoCo model can be trained in roughly 70 V100-days” ○ For reference, MoCo is another self-supervised model but it has a ResNet backbone that is capable of handling a 224 x 224 image resolution ● All that for only a 64x64 resolution!